Protein design automation for protein libraries

ABSTRACT

The invention relates to the use of protein design automation (PDA) to generate computationally prescreened secondary libraries of proteins, and to methods and compositions utilizing the libraries.

FIELD OF THE INVENTION

[0001] The invention relates to the use of a variety of computationmethods, including protein design automation (PDA), to generatecomputationally prescreened secondary libraries of proteins, and tomethods of making and methods and compositions utilizing the libraries.

BACKGROUND OF THE INVENTION

[0002] Directed molecular evolution can be used to create proteins andenzymes with novel functions and properties. Starting with a knownnatural protein, several rounds of mutagenesis, functional screening,and propagation of successful sequences are performed. The advantage ofthis process is that it can be used to rapidly evolve any proteinwithout knowledge of its structure. Several different mutagenesisstrategies exist, including point mutagenesis by error-prone PCR,cassette mutagenesis, and DNA shuffling. These techniques have had manysuccesses; however, they are all handicapped by their inability toproduce more than a tiny fraction of the potential changes. For example,there are 20⁵⁰⁰ possible amino acid changes for an average proteinapproximately 500 amino acids long. Clearly, the mutagenesis andfunctional screening of so many mutants is impossible; directedevolution provides a very sparse sampling of the possible sequences andhence examines only a small portion of possible improved proteins,typically point mutants or recombinations of existing sequences. Bysampling randomly from the vast number of possible sequences, directedevolution is unbiased and broadly applicable, but inherently inefficientbecause it ignores all structural and biophysical knowledge of proteins.

[0003] In contrast, computational methods can be used to screen enormoussequence libraries (up to 10⁸⁰ in a single calculation) overcoming thekey limitation of experimental library screening methods such asdirected molecular evolution. There are a wide variety of methods knownfor generating and evaluating sequences. These include, but are notlimited to, sequence profiling (Bowie and Eisenberg, Science 253(5016):164-70, (1991)), rotamer library selections (Dahiyat and Mayo, ProteinSci 5(5): 895-903 (1996); Dahiyat and Mayo, Science 278(5335): 82-7(1997); Desjarlais and Handel, Protein Science 4: 2006-2018 (1995);Harbury et al, PNAS USA 92(18): 8408-8412 (1995); Kono et al., Proteins:Structure, Function and Genetics 19: 244-255 (1994); Hellinga andRichards, PNAS USA 91: 5803-5807 (1994)); and residue pair potentials(Jones, Protein Science 3: 567-574, (1994)).

[0004] In particular, U.S. Ser. Nos. 60/061,097, 60/043,464, 60/054,678,09/127,926 and PCT US98/07254 describe a method termed “Protein DesignAutomation”, or PDA, that utilizes a number of scoring functions toevaluate sequence stability.

[0005] It is an object of the present invention to provide computationalmethods for prescreening sequence libraries to generate and selectsecondary libraries, which can then be made and evaluatedexperimentally.

SUMMARY OF THE INVENTION

[0006] In accordance with the objects outlined above, the presentinvention provides methods for generating a secondary library ofscaffold protein variants comprising providing a primary librarycomprising a rank-ordered list of scaffold protein primary variantsequences. A list of primary variant positions in the primary library isthen generated, and a plurality of the primary variant positions is thencombined to generate a secondary library of secondary sequences.

[0007] In an additional aspect, the invention provides methods forgenerating a secondary library of scaffold protein variants comprisingproviding a primary library comprising a rank-ordered list of scaffoldprotein primary variant sequences, and generating a probabilitydistribution of amino acid residues in a plurality of variant positions.The plurality of the amino acid residues is combined to generate asecondary library of secondary sequences. These sequences may then beoptionally synthesized and tested, in a variety of ways, includingmultiplexing PCR with pooled oligonucleotides, error prone PCR, geneshuffling, etc.

[0008] In a further aspect, the invention provides compositionscomprising a plurality of secondary variant proteins or nucleic acidsencoding the proteins, wherein the plurality comprises all or a subsetof the secondary library. The invention further provides cellscomprising the library, particularly mammalian cells.

[0009] In an additional aspect, the invention provides methods forgenerating a secondary library of scaffold protein variants comprisingproviding a first library rank-ordered list of scaffold protein primaryvariants;

[0010] generating a probability distribution of amino acid residues in aplurality of variant positions; and synthesizing a plurality of scaffoldprotein secondary variants comprising a plurality of the amino acidresidues to form a secondary library. At least one of the secondaryvariants is different from the primary variants.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011]FIG. 1 depicts the synthesis of a full-length gene and allpossible mutations by PCR. Overlapping oligonucleotides corresponding tothe full-length gene (black bar, Step 1) are synthesized, heated andannealed. Addition of Pfu DNA polymerase to the annealedoligonucleotides results in the 5′→3′ synthesis of DNA (Step 2) toproduce longer DNA fragments (Step 3). Repeated cycles of heating,annealing (Step 4) results in the production of longer DNA, includingsome full-length molecules. These can be selected by a second round ofPCR using primers (arrowed) corresponding to the end of the full-lengthgene (Step 5).

[0012]FIG. 2 depicts the reduction of the dimensionality of sequencespace by PDA screening. From left to right, 1: without PDA; 2: withoutPDA not counting Cysteine, Proline, Glycine; 3: with PDA using the 1%criterion, modeling free enzyme; 4: with PDA using the 1% criterion,modeling enzyme-substrate complex; 5: with PDA using the 5% criterionmodeling free enzyme; 6: with PDA using the 5% criterion modelingenzyme-substrate complex.

[0013]FIG. 3 depicts the active site of B. circulans xylanase. Thosepositions included in the PDA design are shown by their side chainrepresentation. In red are wild type residues (their conformation wasallowed to change, but not their amino acid identity). In green arepositions whose conformation and identity were allowed to change (to anyamino acid except proline, cysteine and glycine).

[0014]FIG. 4 depicts cefotaxime resistance of E. coli expressing wildtype (WT) and PDA Screened β-lactamase; results shown for increasingconcentrations of cefotaxime.

[0015]FIG. 5 depicts a preferred scheme for synthesizing a library ofthe invention. The wild-type gene, or any starting gene, such as thegene for the global minima gene, can be used. Oligonucleotidescomprising different amino acids at the different variant positions canbe used during PCR using standard primers. This generally requires feweroligonucleotides and can result in fewer errors.

[0016]FIG. 6 depicts and overlapping extension method. At the top ofFIG. 6 is the template DNA showing the locations of the regions to bemutated (black boxes) and the binding sites of the relevant primers(arrows). The primers R1 and R2 represent a pool of primers, eachcontaining a different mutation; as described herein, this may be doneusing different ratios of primers if desired. The variant position isflanked by regions of homology sufficient to get hybridization. In thisexample, three separate PCR reactions are done for step 1. The firstreaction contains the template plus oligos F1 and R1. The secondreaction contains template plus F2 and R2, and the third contains thetemplate and F3 and R3. The reaction products are shown. In Step 2, theproducts from Step 1 tube 1 and Step 1 tube 2 are taken. Afterpurification away from the primers, these are added to a fresh PCRreaction together with F1 and R4. During the Denaturation phase of thePCR, the overlapping regions anneal and the second strand issynthesized. The product is then amplified by the outside primers. InStep 3, the purified product from Step 2 is used in a third PCRreaction, together with the product of Step 1, tube 3 and the primers F1and R3. The final product corresponds to the full length gene andcontains the required mutations.

[0017]FIG. 7 depicts a ligation of PCR reaction products to synthesizethe libraries of the invention. In this technique, the primers alsocontain an endonuclease restriction site (RE), either blunt, 5′overhanging or 3′ overhanging. We set up three separate PCR reactionsfor Step 1. The first reaction contains the template plus oligos F1 andR1. The second reaction contains template plus F2 and R2, and the thirdcontains the template and F3 and R3. The reaction products are shown. InStep 2, the products of step 1 are purified and then digested with theappropriate restriction endonuclease. The digestion products from Step2, tube 1 and Step 2, tube 2 and ligate them together with DNA ligase(step 3). The products are then amplified in Step 4 using primer F1 andR4. The whole process is then repeated by digesting the amplifiedproducts, ligating them to the digested products of Step 2, tube 3, andthen amplifying the final product by primers F1 and R3. It would also bepossible to ligate all three PCR products from Step 1 together in onereaction, providing the two restriction sites (RE1 and RE2) weredifferent.

[0018]FIG. 8 depicts blunt end ligation of PCR products. In thistechnique, the primers such as F1 and R1 do not overlap, but they abut.Again three separate PCR reactions are performed. The products from tube1 and tube 2 are ligated, and then amplified with outside primers F1 andR4. This product is then ligated with the product from Step 1, tube 3.The final products are then amplified with primers F1 and R3.

[0019]FIG. 9 depicts M13 single stranded template production of mutatedPCR products. Primer1 and Primer2 (each representing a pool of primerscorresponding to desired mutations) are mixed with the M13 templatecontaining the wildtype gene or any starting gene. PCR produces thedesired product (11) containing the combinations of the desiredmutations incorporated in Primer1 and Primer2. This scheme can be usedto produce a gene with mutations, or fragments of a gene with mutationsthat are then linked together via ligation or PCR for example.

DETAILED DESCRIPTION OF THE INVENTION

[0020] The present invention is directed to methods of usingcomputational screening of protein sequence libraries (that can compriseup to 10⁸⁰ or more members) to select smaller libraries of proteinsequences (that can comprise up to 10¹³ members), that can then be usedin a number of ways. For example, the proteins can be actuallysynthesized and experimentally tested in the desired assay, for improvedfunction and properties. Similarly, the library can be additionallycomputationally manipulated to create a new library which then itselfcan be experimentally tested.

[0021] The invention has two broad uses; first, the invention can beused to prescreen libraries based on known scaffold proteins. That is,computational screening for stability (or other properties) may be doneon either the entire protein or some subset of residues, as desired anddescribed below. By using computational methods to generate a thresholdor cutoff to eliminate disfavored sequences, the percentage of usefulvariants in a given variant set size can increase, and the requiredexperimental outlay is decreased.

[0022] In addition, the present invention finds use in the screening ofrandom peptide libraries. As is known, signaling pathways in cells oftenbegin with an effector stimulus that leads to a phenotypicallydescribable change in cellular physiology. Despite the key roleintracellular signaling pathways play in disease pathogenesis, in mostcases, little is understood about a signaling pathway other than theinitial stimulus and the ultimate cellular response.

[0023] Historically, signal transduction has been analyzed bybiochemistry or genetics. The biochemical approach dissects a pathway ina “stepping-stone” fashion: find a molecule that acts at, or is involvedin, one end of the pathway, isolate assayable quantities and then try todetermine the next molecule in the pathway, either upstream ordownstream of the isolated one. The genetic approach is classically a“shot in the dark”: induce or derive mutants in a signaling pathway andmap the locus by genetic crosses or complement the mutation with a cDNAlibrary. Limitations of biochemical approaches include a reliance on asignificant amount of pre-existing knowledge about the constituentsunder study and the need to carry such studies out in vitro,post-mortem. Limitations of purely genetic approaches include the needto first derive and then characterize the pathway before proceeding withidentifying and cloning the gene.

[0024] Screening molecular libraries of chemical compounds for drugsthat regulate signal systems has led to important discoveries of greatclinical significance. Cyclosporin A (CsA) and FK506, for examples, wereselected in standard pharmaceutical screens for inhibition of T-cellactivation. It is noteworthy that while these two drugs bind completelydifferent cellular proteins—cyclophilin and FK506 binding protein(FKBP), respectively, the effect of either drug is virtually thesame—profound and specific suppression of T-cell activation,phenotypically observable in T cells as inhibition of mRNA productiondependent on transcription factors such as NF-AT and NF-KB. Libraries ofsmall peptides have also been successfully screened in vitro in assaysfor bioactivity. The literature is replete with examples of smallpeptides capable of modulating a wide variety of signaling pathways. Forexample, a peptide derived from the HIV-1 envelope protein has beenshown to block the action of cellular calmodulin.

[0025] Accordingly, generation of random or semi-random sequencelibraries of proteins and peptides allows for the selection of proteins(including peptides, oligopeptides and polypeptides) with usefulproperties. The sequences in these experimental libraries can berandomized at specific sites only, or throughout the sequence. Thenumber of sequences that can be searched in these libraries growsexpontentially with the number of positions that are randomized.Generally, only up to 10¹²- 10¹⁵ sequences can be contained in a librarybecause of the physical constraints of laboratories (the size of theinstruments, the cost of producing large numbers of biopolymers, etc.).Other practical considerations can often limit the size of the librariesto 10⁶ or fewer. These limits are reached for only 10 amino acidpositions. Therefore, only a sparse sampling of sequences is possible inthe search for improved proteins or peptides in experimental sequencelibraries, lowering the chance of success and almost certainly missingdesirable candidates. Because of the randomness of the changes in thesesequences, most of the candidates in the library are not suitable,resulting in a waste of most of the effort in producing the library.

[0026] However, using the automated protein design techniques outlinedbelow, virtual libraries of protein sequences can be generated that arevastly larger than experimental libraries. Up to 10⁸⁰ candidatesequences can be screened computationally and those that meet designcriteria which favor stable and functional proteins can be readilyselected. An experimental library consisting of the favorable candidatesfound in the virtual library screening can then be generated, resultingin a much more efficient use of the experimental library and overcomingthe limitations of random protein libraries.

[0027] Two principle benefits come from the virtual library screening:(1) the automated protein design generates a list of sequence candidatesthat are favored to meet design criteria; it also shows which positionsin the sequence are readily changed and which positions are unlikely tochange without disrupting protein stability and function. Anexperimental random library can be generated that is only randomized atthe readily changeable, non-disruptive sequence positions. (2) Thediversity of amino acids at these positions can be limited to those thatthe automated design shows are compatible with these positions. Thus, bylimiting the number of randomized positions and the number ofpossibilities at these positions, the number of wasted sequencesproduced in the experimental library is reduced, thereby increasing theprobability of success in finding sequences with useful properties.

[0028] In addition, by computationally screening very large libraries ofmutants, greater diversity of protein sequences can be screened (i.e. alarger sampling of sequence space), leading to greater improvements inprotein function. Further, fewer mutants need to be testedexperimentally to screen a given library size, reducing the cost anddifficulty of protein engineering. By using computational methods topre-screen a protein library, the computational features of speed andefficiency are combined with the ability of experimental libraryscreening to create new activities in proteins for which appropriatecomputational models and structure-function relationships are unclear.

[0029] Similarly, novel methods to create secondary libraries derivedfrom very large computational mutant libraries allow the rapid testingof large numbers of computationally designed sequences.

[0030] In addition, as is more fully outlined below, the libraries maybe biased in any number of ways, allowing the generation of secondarylibraries that vary in their focus; for example, domains, subsets ofresidues, active or binding sites, surface residues, etc., may all bevaried or kept constant as desired.

[0031] In general, as more fully outlined below, the invention can takeon a wide variety of configurations. In general, primary libraries, e.g.libraries of all or a subset of possible proteins are generatedcomputationally. This can be done in a wide variety of ways, includingsequence alignments of related proteins, structural alignments,structural prediction models, databases, or (preferably) protein designautomation computational analysis. Similarly, primary libraries can begenerated via sequence screening using a set of scaffold structures thatare created by perturbing the starting structure (using any number oftechniques such as molecular dynamics, Monte Carlo analysis) to makechanges to the protein (including backbone and sidechain torsion anglechanges). Optimal sequences can be selected for each starting structures(or, some set of the top sequences) to make primary libraries.

[0032] Some of these techniques result in the list of sequences in theprimary library being“scored”, or “ranked” on the basis of someparticular criteria. In some embodiments, lists of sequences that aregenerated without ranking can then be ranked using techniques asoutlined below.

[0033] In a preferred embodiment, some subset of the primary library isthen experimentally generated to form a secondary library.Alternatively, some or all of the primary library members are recombinedto form a secondary library, e.g. with new members. Again, this may bedone either computationally or experimentally or both.

[0034] Alternatively, once the primary library is generated, it can bemanipulated in a variety of ways. In one embodiment, a different type ofcomputational analysis can be done; for example, a new type of rankingmay be done. Alternatively, and the primary library can be recombined,e.g. residues at different positions mixed to form a new, secondarylibrary. Again, this can be done either computationally orexperimentally, or both.

[0035] Accordingly, the present invention provides methods forgenerating secondary libraries of scaffold protein variants. By“protein” herein is meant at least two amino acids linked together by apeptide bond. As used herein, protein includes proteins, oligopeptidesand peptides. The peptidyl group may comprise naturally occurring aminoacids and peptide bonds, or synthetic peptidomimetic structures, i.e.“analogs”, such as peptoids (see Simon et al., PNAS USA 89(20):9367(1992)). The amino acids may either be naturally occurring ornon-naturally occurring; as will be appreciated by those in the art, anystructure for which a set of rotamers is known or can be generated canbe used as an amino acid. The side chains may be in either the (R) orthe (S) configuration. In a preferred embodiment, the amino acids are inthe (S) or L-configuration.

[0036] The scaffold protein may be any protein for which a threedimensional structure is known or can be generated; that is, for whichthere are three dimensional coordinates for each atom of the protein.Generally this can be determined using X-ray crystallographictechniques, NMR techniques, de novo modelling, homology modelling, etc.In general, if X-ray structures are used, structures at 2 Å resolutionor better are preferred, but not required.

[0037] The scaffold proteins may be from any organism, includingprokaryotes and eukaryotes, with enzymes from bacteria, fungi,extremeophiles such as the archebacteria, insects, fish, animals(particularly mammals and particularly human) and birds all possible.

[0038] Thus, by “scaffold protein” herein is meant a protein for which asecondary library of variants is desired. As will be appreciated bythose in the art, any number of scaffold proteins find use in thepresent invention. Specifically included within the definition of“protein” are fragments and domains of known proteins, includingfunctional domains such as enzymatic domains, binding domains, etc., andsmaller fragments, such as turns, loops, etc. That is, portions ofproteins may be used as well. In addition, “protein” as used hereinincludes proteins, oligopeptides and peptides. In addition, proteinvariants, i.e. non-naturally occuring protein analog structures, may beused.

[0039] Suitable proteins include, but are not limited to, industrial andpharmaceutical proteins, including ligands, cell surface receptors,antigens, antibodies, cytokines, hormones, transcription factors,signaling modules, cytoskeletal proteins and enzymes. Suitable classesof enzymes include, but are not limited to, hydrolases such asproteases, carbohydrases, lipases; isomerases such as racemases,epimerases, tautomerases, or mutases; transferases, kinases,oxidoreductases, and phophatases. Suitable enzymes are listed in theSwiss-Prot enzyme database. Suitable protein backbones include, but arenot limited to, all of those found in the protein data base compiled andserviced by the Research Collaboratory for Structural Bioinformatics(RCSB, formerly the Brookhaven National Lab).

[0040] Specifically, preferred scaffold proteins include, but are notlimited to, those with known structures (including variants) includingcytokines (IL-1ra (+receptor complex), IL-1 (receptor alone), IL-1a,IL-1b (including variants and or receptor complex), IL-2, IL-3, IL-4,IL-5, IL-6, IL-8, IL-10, IFN-β, INF-γ, INF-α-2a; IFN-α-2B, TNF-α; CD40ligand (chk), Human Obesity Protein Leptin, GranulocyteColony-Stimulating Factor, Bone Morphogenetic Protein-7, CiliaryNeurotrophic Factor, Granulocyte-Macrophage Colony-Stimulating Factor,Monocyte Chemoattractant Protein 1, Macrophage Migration InhibitoryFactor, Human Glycosylation-Inhibiting Factor, Human Rantes, HumanMacrophage Inflammatory Protein 1 Beta, human growth hormone, LeukemiaInhibitory Factor, Human Melanoma Growth Stimulatory Activity,neutrophil activating peptide-2, Cc-Chemokine Mcp-3, Platelet Factor M2,Neutrophil Activating Peptide 2, Eotaxin, Stromal Cell-Derived Factor-1,Insulin, Insulin-like Growth Factor I, Insulin-like Growth Factor II,Transforming Growth Factor B1, Transforming Growth Factor B2,Transforming Growth Factor B3, Transforming Growth Factor A, VascularEndothelial growth factor (VEGF), acidic Fibroblast growth factor, basicFibroblast growth factor, Endothelial growth factor, Nerve growthfactor, Brain Derived Neurotrophic Factor, Ciliary Neurotrophic Factor,Platelet Derived Growth Factor, Human Hepatocyte Growth Factor, GlialCell-Derived Neurotrophic Factor, (as well as the 55 cytokines in PDB1/12/99)); Erythropoietin; other extracellular signalling moeities,including, but not limited to, hedgehog Sonic, hedgehog Desert, hedgehogIndian, hCG; coaguation factors including, but not limited to, TPA andFactor VIIa; transcription factors, including but not limited to, p53,p53 tetramerization domain, Zn fingers (of which more than 12 havestructures), homeodomains (of which 8 have structures), leucine zippers(of which 4 have structures); antibodies, including, but not limited to,cFv; viral proteins, including, but not limited to, hemagglutinintrimerization domain and hiv Gp41 ectodomain (fusion domain);intracellular signalling modules, including, but not limited to, SH2domains (of which 8 structures are known), SH3 domains (of which 11 havestructures), and Pleckstin Homology Domains; receptors, including, butnot limited to, the extracellular Region Of Human Tissue FactorCytokine-Binding Region Of Gp130, G-CSF receptor, erythropoietinreceptor, Fibroblast Growth Factor receptor, TNF receptor, IL-1receptor, IL-1 receptor/IL1ra complex, IL-4 receptor, INF-γ receptoralpha chain, MHC Class I, MHC Class II, T Cell Receptor, Insulinreceptor, insulin receptor tyrosine kinase and human growth hormonereceptor.

[0041] Once a scaffold protein is chosen, a primary library is generatedusing computational processing. Generally speaking, in some embodiments,the goal of the computational processing is to determine a set ofoptimized protein sequences. By “optimized protein sequence” herein ismeant a sequence that best fits the mathematical equations of thecomputational process. As will be appreciated by those in the art, aglobal optimized sequence is the one sequence that best fits theequations (for example, when PDA is used, the global optimized sequenceis the sequence that best fits Equation 1, below); i.e. the sequencethat has the lowest energy of any possible sequence. However, there areany number of sequences that are not the global minimum but that havelow energies.

[0042] Thus, a “primary library” as used herein is a collection ofoptimized sequences, generally, but not always, in the form of arank-ordered list. In theory, all possible sequences of a protein may beranked; however, currently 10¹³ sequences is a practical limit. Thus, ingeneral, some subset of all possible sequences is used as the primarylibrary; generally, the top 10³ to 10¹³ sequences are chosen as theprimary library. The cutoff for inclusion in the rank ordered list ofthe primary library can be done in a variety of ways. For example, thecutoff may be just an arbitrary exclusion point: the top 10⁵ sequencesmay comprise the primary library. Alternatively, all sequences scoringwithin a certain limit of the global optimum can be used; for example,all sequences with 10 kcal/mol of the global optimum could be used asthe primary library. This method has the advantage of using a directmeasure of fidelity to a three dimensional structure to determineinclusion. This approach can be used to insure that library mutationsare not limited to positions that have the lowest energy gap betweendifferent mutations. Alternatively, the cutoff may be enforced when apredetermined number of mutations per position is reached. As a rankordered sequence list is lengthened and the library is enlarged, moremutations per position are defined. Alternatively, the total number ofsequences defined by the recombination of all mutations can be used as acutoff criterion for the primary sequence library. Preferred values forthe total number of sequences range from 100 to 10²⁰, particularlypreferred values range from 1000 to 10¹³, especially preferred valuesrange from 1000 to 10⁷ Alternatively, the first occurrence in the listof predefined undesirable residues can be used as a cutoff criterion.For example, the first hydrophilic residue occurring in a core positionwould limit the list. It should also be noted that while these methodsare described in conjunction with limiting the size of the primarylibrary, these same techniques may be used to formulate the cutoff forinclusion in the secondary library as well.

[0043] Thus, the present invention provides methods to generate aprimary library optionally comprising a rank ordered list of sequences,generally in terms of theoretical quantitative stability, as is morefully described below. Generating a primary library to optimize thestability of a conformation can be used to stabilize the active sitetransition state conformation of an enzyme, which will improve itsactivity. Similarly, stabilizing a ligand-receptor complex orenzyme-substrate complex will improve the binding affinity.

[0044] The primary libraries can be generated in a variety of ways. Inessence, any methods that can result in either the relative ranking ofthe possible sequences of a protein based on measurable stabilityparameters, or a list of suitable sequences can be used. As will beappreciated by those in the art, any of the methods described herein orknown in the art may be used alone, or in combination with othermethods.

[0045] Generally, there are a variety of computational methods that canbe used to generate a primary library. In a preferred embodiment,sequence based methods are used. Alternatively, structure based methods,such as PDA, described in detail below, are used.

[0046] In a preferred embodiment, the scaffold protein is an enzyme andhighly accurate electrostatic models can be used for enzyme active siteresidue scoring to improve enzyme active site libraries (see Warshel,computer Modeling of Chemical Reactions in Enzymes and Solutions, Wiley& Sons, New York, (1991), hereby expressly incorporated by reference)These accurate models can assess the relative energies of sequences withhigh precision, but are computationally intensive.

[0047] Similarly, molecular dynamics calculations can be used tocomputationally screen sequences by individually calculating mutantsequence scores and compiling a rank ordered list.

[0048] In a preferred embodiment, residue pair potentials can be used toscore sequences (Miyazawa et al., Macromolecules 18(3):534-552 (1985),expressly incorporated by reference) during computational screening.

[0049] In a preferred embodiment, sequence profile scores (Bowie et al.,Science 253(5016):164-70 (1991), incorporated by reference) and/orpotentials of mean force (Hendlich et al., J. Mol. Biol. 216(1):167-180(1990), also incorporated by reference) can also be calculated to scoresequences. These methods assess the match between a sequence and a 3Dprotein structure and hence can act to screen for fidelity to theprotein structure. By using different scoring functions to ranksequences, different regions of sequence space can be sampled in thecomputational screen.

[0050] Furthermore, scoring functions can be used to screen forsequences that would create metal or co-factor binding sites in theprotein (Hellinga, Fold Des. 3(1):R1-8 (1998), hereby expresslyincorporated by reference). Similarly, scoring functions can be used toscreen for sequences that would create disulfide bonds in the protein.These potentials attempt to specifically modify a protein structure tointroduce a new structural motif.

[0051] In a preferred embodiment, sequence and/or structural alignmentprograms can be used to generate primary libraries. As is known in theart, there are a number of sequence-based alignment programs; includingfor example, Smith-Waterman searches, Needleman-Wunsch, Double AffineSmith-Waterman, frame search, Gribskov/GCG profile search, Gribskov/GCGprofile scan, profile frame search, Bucher generalized profiles, HiddenMarkov models, Hframe, Double Frame, Blast, Psi-Blast, Clustal, andGeneWise.

[0052] The source of the sequences can vary widely, and include takingsequences from one or more of the known databases, including, but notlimited to, SCOP (Hubbard, et al., Nucleic Acids Res 27(1):254-256.(1999)); PFAM (Bateman, et al., Nucleic Acids Res 27(1):260-262.(1999)); VAST (Gibrat, et Curr Opin Struct Biol 6(3):377-385. (1996));CATH (Orengo, et al., Structure 5(8):1093-1108. (1997)); PhD Predictor(http://www.embl-heidelberg.de/predictprotein/predictprotein.html);Prosite (Hofmann, et al., Nucleic Acids Res 27(1):215-219. (1999)); PIR(http://www.mips.biochem.mpg.de/proj/protseqdb/); GenBank(http://www.ncbi.nlm.nih.gov/); PDB (www.rcsb.org) and BIND (Bader, etal., Nucleic Acids Res 29(1):242-245. (2001)).

[0053] In addition, sequences from these databases can be subjected tocontinguous analysis or gene prediction; see Wheeler, et al., NucleicAcids Res 28(1):10-14. (2000) and Burge and Karlin, J. Mol Biol268(1):78-94. (1997).

[0054] As is known in the art, there are a number of sequence alignmentmethodologies that can be used. For example, sequence homology basedalignment methods can be used to create sequence alignments of proteinsrelated to the target structure (Altschul et al., J. Mol. Biol.215(3):403 (1990), incorporated by reference). These sequence alignmentsare then examined to determine the observed sequence variations. Thesesequence variations are tabulated to define a primary library. Inaddition, as is further outlined below, these methods can also be usedto generate secondary libraries.

[0055] Sequence based alignments can be used in a variety of ways. Forexample, a number of related proteins can be aligned, as is known in theart, and the “variable” and “conserved” residues defined; that is, theresidues that vary or remain identical between the family members can bedefined. These results can be used to generate a probability table, asoutlined below. Similarly, these sequence variations can be tabulatedand a secondary library defined from them as defined below.Alternatively, the allowed sequence variations can be used to define theamino acids considered at each position during the computationalscreening. Another variation is to bias the score for amino acids thatoccur in the sequence alignment, thereby increasing the likelihood thatthey are found during computational screening but still allowingconsideration of other amino acids. This bias would result in a focusedprimary library but would not eliminate from consideration amino acidsnot found in the alignment. In addition, a number of other types of biasmay be introduced. For example, diversity may be forced; that is, a“conserved” residue is chosen and altered to force diversity on theprotein and thus sample a greater portion of the sequence space.Alternatively, the positions of high variability between family members(i.e. low conservation) can be randomized, either using all or a subsetof amino acids. Similarly, outlier residues, either positional outliersor side chain outliers, may be eliminated.

[0056] Similarly, structural alignment of structurally related proteinscan be done to generate sequence alignments. There are a wide variety ofsuch structural alignment programs known. See for example VAST from theNCBI (http://www.ncbi.nim.nih.gov:80/Structure/VAST/vast.shtml); SSAP(Orengo and Taylor, Methods Enzymol 266(617-635 (1996)) SARF2(Alexandrov, Protein Eng 9(9):727-732. (1996)) CE (Shindyalov andBourne, Protein Eng 11(9):739-747. (1998)); (Orengo et al., Structure5(8):1093-108 (1997); Dali (Holm et al., Nucleic Acid Res. 26(1):316-9(1998), all incorporated by reference). These structurally-generatedsequence alignments can then be examined to determine the observedsequence variations.

[0057] Primary libraries can be generated by predicting secondarystructure from sequence, and then selecting sequences that arecompatible with the predicted secondary structure. There are a number ofsecondary structure prediction methods, including, but not limited to,threading (Bryant and Altschul, Curr Opin Struct Biol 5(2):236-244.(1995)), Profile 3D (Bowie, et al., Methods Enzymol 266(598-616 (1996);MONSSTER (Skolnick, et al., J Mol Biol 265(2):217-241. (1997); Rosetta(Simons, et al., Proteins 37(S3):171-176 (1999); PSI-BLAST (Altschul andKoonin, Trends Biochem Sci 23(11):444-447. (1998)); Impala (Schaffer, etal., Bioinformatics 15(12):1000-1011. (1999)); HMMER (McClure, et al.,Proc Int Conf Intell Syst Mol Biol 4(155-164 (1996)); Clustal W(http://www.ebi.ac.uk/clustalw/); BLAST (Altschul, et al., J Mol Biol215(3):403-410. (1990)), helix-coil transition theory (Munoz andSerrano, Biopolymers 41:495, 1997), neural networks, local structurealignment and others (e.g., see in Selbig et al., Bioinformatics15:1039, 1999).

[0058] Similarly, as outlined above, other computational methods areknown, including, but not limited to, sequence profiling (Bowie andEisenberg, Science 253(5016): 164-70, (1991)), rotamer libraryselections (Dahiyat and Mayo, Protein Sci 5(5): 895-903 (1996); Dahiyatand Mayo, Science 278(5335): 82-7 (1997); Desjarlais and Handel, ProteinScience 4: 2006-2018 (1995); Harbury et al, PNAS USA 92(18): 8408-8412(1995); Kono et al., Proteins: Structure, Function and Genetics 19:244-255 (1994); Hellinga and Richards, PNAS USA 91: 5803-5807 (1994));and residue pair potentials (Jones, Protein Science 3: 567-574, (1994);PROSA (Heindlich et al., J. Mol. Biol. 216:167-180 (1990); THREADER(Jones et al., Nature 358:86-89 (1992), and other inverse foldingmethods such as those described by Simons et al. (Proteins, 34:535-543,1999), Levitt and Gerstein (PNAS USA , 95:5913-5920, 1998), Godzik etal., PNAS, V89, PP 12098-102; Godzik and Skolnick (PNAS USA ,89:12098-102, 1992), Godzik et al. (J. Mol. Biol. 227:227-38, 1992) andtwo profile methods (Gribskov et al. PNAS 84:4355-4358 (1987) andFischer and Eisenberg, Protein Sci. 5:947-955 (1996), Rice and EisenbergJ. Mol. Biol. 267:1026-1038(1997)), all of which are expresslyincorporated by reference. In addition, other computational methods suchas those described by Koehl and Levitt (J. Mol. Biol. 293:1161-1181(1999); J. Mol. Biol. 293:1183-1193 (1999); expressly incorporated byreference) can be used to create a protein sequence library which canoptionally then be used to generate a smaller secondary library for usein experimental screening for improved properties and function.

[0059] In addition, there are computational methods based on forcefieldcalculations such as SCMF that can be used as well for SCMF, see Delarueet la. Pac. Symp. Biocomput. 109-21 (1997), Koehl et al., J. Mol. Biol.239:249 (1994); Koehl et al., Nat. Struc. Biol. 2:163 (1995); Koehl etal., Curr. Opin. Struct. Biol. 6:222 (1996); Koehl et al., J. Mol. Bio.293:1183 (1999); Koehl et al., J. Mol. Biol. 293:1161 (1999); Lee J.Mol. Biol. 236:918 (1994); and Vasquez Biopolymers 36:53-70 (1995); allof which are expressly incorporated by reference. Other forcefieldcalculations that can be used to optimize the conformation of a sequencewithin a computational method, or to generate de novo optimizedsequences as outlined herein include, but are not limited to, OPLS-AA(Jorgensen, et al., J. Am. Chem. Soc. (1996), v 118, pp 11225-11236;Jorgensen, W. L.; BOSS, Version 4.1; Yale University: New Haven, Conn.(1999)); OPLS (Jorgensen, et al., J. Am. Chem. Soc. (1988), v 110, pp1657ff; Jorgensen, et al., J Am. Chem. Soc. (1990), v 112, pp 4768ff);UNRES (United Residue Forcefield; Liwo, et al., Protein Science (1993),v 2, pp1697-1714; Liwo, et al., Protein Science (1993), v 2,pp1715-1731; Liwo, et al., J. Comp. Chem. (1997), v 18, pp849-873; Liwo,et al., J. Comp. Chem. (1997), v 18, pp874-884; Liwo, et al., J. Comp.Chem. (1998), v 19, pp259-276; Forcefield for Protein StructurePrediction (Liwo, et al., Proc. Natl. Acad. Sci. USA (1999), v 96,pp5482-5485); ECEPP/3 (Liwo et al., J Protein Chem 1994May;13(4):375-80); AMBER 1.1 force field (Weiner, et al., J. Am. Chem.Soc. v 106, pp765-784); AMBER 3.0 force field (U. C. Singh et al., Proc.Natl. Acad. Sci. USA. 82:755-759); CHARMM and CHARMM22 (Brooks, et al.,J. Comp. Chem. v4, pp 187-217); cvff3.0 (Dauber-Osguthorpe, etal.,(1988) Proteins: Structure, Function and Genetics, v4,pp31-47);cff91 (Maple, et al., J. Comp. Chem. v15, 162-182); also, the DISCOVER(cvff and cff91) and AMBER forcefields are used in the INSIGHT molecularmodeling package (Biosym/MSI, San Diego Calif.) and HARMM is used in theQUANTA molecular modeling package (Biosym/MSI, San Diego Calif.), all ofwhich are expressly incorporated by reference. In fact, as is outlinedbelow, these forcefield methods may be used to generate the secondarylibrary directly; that is, no primary library is generated; rather,these methods can be used to generate a probability table from which thesecondary library is directly generated, for example by using theseforcefields during an SCMF calculation.

[0060] In a preferred embodiment, the computational method used togenerate the primary library is Protein Design Automation (PDA), as isdescribed in U.S. Ser. Nos. 60/061,097, 60/043,464, 60/054,678,09/127,926 and PCT US98/07254, all of which are expressly incorporatedherein by reference. Briefly, PDA can be described as follows. A knownprotein structure is used as the starting point. The residues to beoptimized are then identified, which may be the entire sequence orsubset(s) thereof. The side chains of any positions to be varied arethen removed. The resulting structure consisting of the protein backboneand the remaining sidechains is called the template. Each variableresidue position is then preferably classified as a core residue, asurface residue, or a boundary residue; each classification defines asubset of possible amino acid residues for the position (for example,core residues generally will be selected from the set of hydrophobicresidues, surface residues generally will be selected from thehydrophilic residues, and boundary residues may be either). Each aminoacid can be represented by a discrete set of all allowed conformers ofeach side chain, called rotamers. Thus, to arrive at an optimal sequencefor a backbone, all possible sequences of rotamers must be screened,where each backbone position can be occupied either by each amino acidin all its possible rotameric states, or a subset of amino acids, andthus a subset of rotamers.

[0061] Two sets of interactions are then calculated for each rotamer atevery position: the interaction of the rotamer side chain with all orpart of the backbone (the “singles” energy, also called therotamer/template or rotamer/backbone energy), and the interaction of therotamer side chain with all other possible rotamers at every otherposition or a subset of the other positions (the “doubles” energy, alsocalled the rotamer/rotamer energy). The energy of each of theseinteractions is calculated through the use of a variety of scoringfunctions, which include the energy of van der Waal's forces, the energyof hydrogen bonding, the energy of secondary structure propensity, theenergy of surface area salvation and the electrostatics. Thus, the totalenergy of each rotamer interaction, both with the backbone and otherrotamers, is calculated, and stored in a matrix form.

[0062] The discrete nature of rotamer sets allows a simple calculationof the number of rotamer sequences to be tested. A backbone of length nwith m possible rotamers per position will have m^(n) possible rotamersequences, a number which grows exponentially with sequence length andrenders the calculations either unwieldy or impossible in real time.Accordingly, to solve this combinatorial search problem, a “Dead EndElimination” (DEE) calculation is performed. The DEE calculation isbased on the fact that if the worst total interaction of a first rotameris still better than the best total interaction of a second rotamer,then the second rotamer cannot be part of the global optimum solution.Since the energies of all rotamers have already been calculated, the DEEapproach only requires sums over the sequence length to test andeliminate rotamers, which speeds up the calculations considerably. DEEcan be rerun comparing pairs of rotamers, or combinations of rotamers,which will eventually result in the determination of a single sequencewhich represents the global optimum energy.

[0063] Once the global solution has been found, a Monte Carlo search maybe done to generate a rank-ordered list of sequences in the neighborhoodof the DEE solution. Starting at the DEE solution, random positions arechanged to other rotamers, and the new sequence energy is calculated. Ifthe new sequence meets the criteria for acceptance, it is used as astarting point for another jump. After a predetermined number of jumps,a rank-ordered list of sequences is generated. Monte Carlo searching isa sampling technique to explore sequence space around the global minimumor to find new local minima distant in sequence space. As is moreadditionally outlined below, there are other sampling techniques thatcan be used, including Boltzman sampling, genetic algorithm techniquesand simulated annealing. In addition, for all the sampling techniques,the kinds of jumps allowed can be altered (e.g. random jumps to randomresidues, biased jumps (to or away from wild-type, for example), jumpsto biased residues (to or away from similar residues, for example),etc.). Similarly, for all the sampling techniques, the acceptancecriteria of whether a sampling jump is accepted can be altered.

[0064] As outlined in U.S. Ser. No. 09/127,926, the protein backbone(comprising (for a naturally occuring protein) the nitrogen, thecarbonyl carbon, the α-carbon, and the carbonyl oxygen, along with thedirection of the vector from the α-carbon to the β-carbon) may bealtered prior to the computational analysis, by varying a set ofparameters called supersecondary structure parameters.

[0065] Once a protein structure backbone is generated (with alterations,as outlined above) and input into the computer, explicit hydrogens areadded if not included within the structure (for example, if thestructure was generated by X-ray crystallography, hydrogens must beadded). After hydrogen addition, energy minimization of the structure isrun, to relax the hydrogens as well as the other atoms, bond angles andbond lengths. In a preferred embodiment, this is done by doing a numberof steps of conjugate gradient minimization (Mayo et al., J. Phys. Chem.94:8897 (1990)) of atomic coordinate positions to minimize the Dreidingforce field with no electrostatics. Generally from about 10 to about 250steps is preferred, with about 50 being most preferred.

[0066] The protein backbone structure contains at least one variableresidue position. As is known in the art, the residues, or amino acids,of proteins are generally sequentially numbered starting with theN-terminus of the protein. Thus a protein having a methionine at it'sN-terminus is said to have a methionine at residue or amino acidposition 1, with the next residues as 2, 3, 4, etc. At each position,the wild type (i.e. naturally occuring) protein may have one of at least20 amino acids, in any number of rotamers. By “variable residueposition” herein is meant an amino acid position of the protein to bedesigned that is not fixed in the design method as a specific residue orrotamer, generally the wild-type residue or rotamer.

[0067] In a preferred embodiment, all of the residue positions of theprotein are variable. That is, every amino acid side chain may bealtered in the methods of the present invention. This is particularlydesirable for smaller proteins, although the present methods allow thedesign of larger proteins as well. While there is no theoretical limitto the length of the protein which may be designed this way, there is apractical computational limit.

[0068] In an alternate preferred embodiment, only some of the residuepositions of the protein are variable, and the remainder are “fixed”,that is, they are identified in the three dimensional structure as beingin a set conformation. In some embodiments, a fixed position is left inits original conformation (which may or may not correlate to a specificrotamer of the rotamer library being used). Alternatively, residues maybe fixed as a non-wild type residue; for example, when knownsite-directed mutagenesis techniques have shown that a particularresidue is desirable (for example, to eliminate a proteolytic site oralter the substrate specificity of an enzyme), the residue may be fixedas a particular amino acid. Alternatively, the methods of the presentinvention may be used to evaluate mutations de novo, as is discussedbelow. In an alternate preferred embodiment, a fixed position may be“floated”; the amino acid at that position is fixed, but differentrotamers of that amino acid are tested. In this embodiment, the variableresidues may be at least one, or anywhere from 0.1% to 99.9% of thetotal number of residues. Thus, for example, it may be possible tochange only a few (or one) residues, or most of the residues, with allpossibilities in between.

[0069] In a preferred embodiment, residues which can be fixed include,but are not limited to, structurally or biologically functionalresidues; alternatively, biologically functional residues mayspecifically not be fixed. For example, residues which are known to beimportant for biological activity, such as the residues which form theactive site of an enzyme, the substrate binding site of an enzyme, thebinding site for a binding partner (ligand/receptor, antigen/antibody,etc.), phosphorylation or glycosylation sites which are crucial tobiological function, or structurally important residues, such asdisulfide bridges, metal binding sites, critical hydrogen bondingresidues, residues critical for backbone conformation such as proline orglycine, residues critical for packing interactions, etc. may all befixed in a conformation or as a single rotamer, or “floated”.

[0070] Similarly, residues which may be chosen as variable residues maybe those that confer undesirable biological attributes, such assusceptibility to proteolytic degradation, dimerization or aggregationsites, glycosylation sites which may lead to immune responses, unwantedbinding activity, unwanted allostery, undesirable enzyme activity butwith a preservation of binding, etc.

[0071] In a preferred embodiment, each variable position is classifiedas either a core, surface or boundary residue position, although in somecases, as explained below, the variable position may be set to glycineto minimize backbone strain. In addition, as outlined herein, residuesneed not be classified, they can be chosen as variable and any set ofamino acids may be used. Any combination of core, surface and boundarypositions can be utilized: core, surface and boundary residues; core andsurface residues; core and boundary residues, and surface and boundaryresidues, as well as core residues alone, surface residues alone, orboundary residues alone.

[0072] The classification of residue positions as core, surface orboundary may be done in several ways, as will be appreciated by those inthe art. In a preferred embodiment, the classification is done via avisual scan of the original protein backbone structure, including theside chains, and assigning a classification based on a subjectiveevaluation of one skilled in the art of protein modelling.Alternatively, a preferred embodiment utilizes an assessment of theorientation of the Cα-Cβ vectors relative to a solvent accessiblesurface computed using only the template Cα atoms, as outlined in U.S.Ser. Nos. 60/061,097, 60/043,464, 60/054,678, 09/127,926 and PCTUS98/07254. Alternatively, a surface area calculation can be done.

[0073] Once each variable position is classified as either core, surfaceor boundary, a set of amino acid side chains, and thus a set ofrotamers, is assigned to each position. That is, the set of possibleamino acid side chains that the program will allow to be considered atany particular position is chosen. Subsequently, once the possible aminoacid side chains are chosen, the set of rotamers that will be evaluatedat a particular position can be determined. Thus, a core residue willgenerally be selected from the group of hydrophobic residues consistingof alanine, valine, isoleucine, leucine, phenylalanine, tyrosine,tryptophan, and methionine (in some embodiments, when the α scalingfactor of the van der Waals scoring function, described below, is low,methionine is removed from the set), and the rotamer set for each coreposition potentially includes rotamers for these eight amino acid sidechains (all the rotamers if a backbone independent library is used, andsubsets if a rotamer dependent backbone is used). Similarly, surfacepositions are generally selected from the group of hydrophilic residuesconsisting of alanine, serine, threonine, aspartic acid, asparagine,glutamine, glutamic acid, arginine, lysine and histidine. The rotamerset for each surface position thus includes rotamers for these tenresidues. Finally, boundary positions are generally chosen from alanine,serine, threonine, aspartic acid, asparagine, glutamine, glutamic acid,arginine, lysine histidine, valine, isoleucine, leucine, phenylalanine,tyrosine, tryptophan, and methionine. The rotamer set for each boundaryposition thus potentially includes every rotamer for these seventeenresidues (assuming cysteine, glycine and proline are not used, althoughthey can be). Additionally, in some preferred embodiments, a set of 18naturally occuring amino acids (all except cysteine and proline, whichare known to be particularly disruptive) are used.

[0074] Thus, as will be appreciated by those in the art, there is acomputational benefit to classifying the residue positions, as itdecreases the number of calculations. It should also be noted that theremay be situations where the sets of core, boundary and surface residuesare altered from those described above; for example, under somecircumstances, one or more amino acids is either added or subtractedfrom the set of allowed amino acids. For example, some proteins whichdimerize or multimerize, or have ligand binding sites, may containhydrophobic surface residues, etc. In addition, residues that do notallow helix “capping” or the favorable interaction with an α-helixdipole may be subtracted from a set of allowed residues. Thismodification of amino acid groups is done on a residue by residue basis.

[0075] In a preferred embodiment, proline, cysteine and glycine are notincluded in the list of possible amino acid side chains, and thus therotamers for these side chains are not used. However, in a preferredembodiment, when the variable residue position has a φ angle (that is,the dihedral angle defined by 1) the carbonyl carbon of the precedingamino acid; 2) the nitrogen atom of the current residue; 3) the α-carbonof the current residue; and 4) the carbonyl carbon of the currentresidue) greater than 0°, the position is set to glycine to minimizebackbone strain.

[0076] Once the group of potential rotamers is assigned for eachvariable residue position, processing proceeds as outlined in U.S. Ser.No. 09/127,926 and PCT US98/07254. This processing step entailsanalyzing interactions of the rotamers with each other and with theprotein backbone to generate optimized protein sequences.Simplistically, the processing initially comprises the use of a numberof scoring functions to calculate energies of interactions of therotamers, either to the backbone itself or other rotamers. Preferred PDAscoring functions include, but are not limited to, a Van der Waalspotential scoring function, a hydrogen bond potential scoring function,an atomic salvation scoring function, a secondary structure propensityscoring function and an electrostatic scoring function. As is furtherdescribed below, at least one scoring function is used to score eachposition, although the scoring functions may differ depending on theposition classification or other considerations, like favorableinteraction with an α-helix dipole. As outlined below, the total energywhich is used in the calculations is the sum of the energy of eachscoring function used at a particular position, as is generally shown inEquation 1:

E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +nE _(ss) +nE_(elec)  Equation 1

[0077] In Equation 1, the total energy is the sum of the energy of thevan der Waals potential (E_(vdw)), the energy of atomic salvation(E_(as)), the energy of hydrogen bonding (E_(h-bonding)), the energy ofsecondary structure (E_(ss)) and the energy of electrostatic interaction(E_(elec)). The term n is either 0 or 1, depending on whether the termis to be considered for the particular residue position.

[0078] As outlined in U.S. Ser. Nos. 60/061,097, 60/043,464, 601054,678,09/127,926 and PCT US98/07254, any combination of these scoringfunctions, either alone or in combination, may be used. Once the scoringfunctions to be used are identified for each variable position, thepreferred first step in the computational analysis comprises thedetermination of the interaction of each possible rotamer with all orpart of the remainder of the protein. That is, the energy ofinteraction, as measured by one or more of the scoring functions, ofeach possible rotamer at each variable residue position with either thebackbone or other rotamers, is calculated. In a preferred embodiment,the interaction of each rotamer with the entire remainder of theprotein, i.e. both the entire template and all other rotamers, is done.However, as outlined above, it is possible to only model a portion of aprotein, for example a domain of a larger protein, and thus in somecases, not all of the protein need be considered. The term “portion”, asused herein, with regard to a protein refers to a fragment of thatprotein. This fragment may range in size from 10 amino acid residues tothe entire amino acid sequence minus one amino acid. Accordingly, theterm “portion”, as used herein, with regard to a nucleic refers to afragment of that nucleic acid. This fragment may range in size from 10nucleotides to the entire nucleic acid sequence minus one nucleotide.

[0079] In a preferred embodiment, the first step of the computationalprocessing is done by calculating two sets of interactions for eachrotamer at every position: the interaction of the rotamer side chainwith the template or backbone (the “singles” energy), and theinteraction of the rotamer side chain with all other possible rotamersat every other position (the “doubles” energy), whether that position isvaried or floated. It should be understood that the backbone in thiscase includes both the atoms of the protein structure backbone, as wellas the atoms of any fixed residues, wherein the fixed residues aredefined as a particular conformation of an amino acid.

[0080] Thus, “singles” (rotamer/template) energies are calculated forthe interaction of every possible rotamer at every variable residueposition with the backbone, using some or all of the scoring functions.Thus, for the hydrogen bonding scoring function, every hydrogen bondingatom of the rotamer and every hydrogen bonding atom of the backbone isevaluated, and the EHB is calculated for each possible rotamer at everyvariable position. Similarly, for the van der Waals scoring function,every atom of the rotamer is compared to every atom of the template(generally excluding the backbone atoms of its own residue), and theE_(Vdw) is calculated for each possible rotamer at every variableresidue position. In addition, generally no van der Waals energy iscalculated if the atoms are connected by three bonds or less. For theatomic solvation scoring function, the surface of the rotamer ismeasured against the surface of the template, and the E_(as) for eachpossible rotamer at every variable residue position is calculated. Thesecondary structure propensity scoring function is also considered as asingles energy, and thus the total singles energy may contain an E_(ss)term. As will be appreciated by those in the art, many of these energyterms will be close to zero, depending on the physical distance betweenthe rotamer and the template position; that is, the farther apart thetwo moieties, the lower the energy.

[0081] For the calculation of “doubles” energy (rotamer/rotamer), theinteraction energy of each possible rotamer is compared with everypossible rotamer at all other variable residue positions. Thus,“doubles” energies are calculated for the interaction of every possiblerotamer at every variable residue position with every possible rotamerat every other variable residue position, using some or all of thescoring functions. Thus, for the hydrogen bonding scoring function,every hydrogen bonding atom of the first rotamer and every hydrogenbonding atom of every possible second rotamer is evaluated, and theE_(HB) is calculated for each possible rotamer pair for any two variablepositions. Similarly, for the van der Waals scoring function, every atomof the first rotamer is compared to every atom of every possible secondrotamer, and the E_(vdW) is calculated for each possible rotamer pair atevery two variable residue positions. For the atomic solvation scoringfunction, the surface of the first rotamer is measured against thesurface of every possible second rotamer, and the E_(as) for eachpossible rotamer pair at every two variable residue positions iscalculated. The secondary structure propensity scoring function need notbe run as a “doubles” energy, as it is considered as a component of the“singles” energy. As will be appreciated by those in the art, many ofthese double energy terms will be close to zero, depending on thephysical distance between the first rotamer and the second rotamer; thatis, the farther apart the two moieties, the lower the energy.

[0082] In addition, as will be appreciated by those in the art, avariety of force fields that can be used in the PCA calculations can beused, including, but not limited to, Dreiding I and Dreiding II (Mayo etal, J. Phys. Chem. 948897 (1990)), AMBER (Weiner et al., J. Amer. Chem.Soc. 106:765 (1984) and Weiner et al., J. Comp. Chem. 106:230 (1986)),MM2 (Allinger J. Chem. Soc. 99:8127 (1977), Liljefors et al., J. Com.Chem. 8:1051 (1987)); MMP2 (Sprague et al., J. Comp. Chem. 8:581(1987)); CHARMM (Brooks et al., J. Comp. Chem. 106:187 (1983)); GROMOS;and MM3 (Allinger et al., J. Amer. Chem. Soc. 111:8551 (1989)), OPLS-AA(Jorgensen, et al., J. Am. Chem. Soc. (1996), v 118, pp 11225-11236;Jorgensen, W. L.; BOSS, Version 4.1; Yale University: New Haven, Conn.(1999)); OPLS (Jorgensen, et al., J. Am. Chem. Soc. (1988), v 110, pp1657ff; Jorgensen, et al., J Am. Chem. Soc. (1990), v 112, pp 4768ff);UNRES (United Residue Forcefield; Liwo, et al., Protein Science (1993),v 2, pp1697-1714; Liwo, et al., Protein Science (1993), v 2,pp1715-1731; Liwo, et al., J. Comp. Chem. (1997), v 18, pp849-873; Liwo,et al., J. Comp. Chem. (1997), v 18, pp874-884; Liwo, et al., J. Comp.Chem. (1998), v 19, pp259-276; Forcefield for Protein StructurePrediction (Liwo, et al., Proc. Natl. Acad. Sci. USA (1999), v 96,pp5482-5485); ECEPP/3 (Liwo et al., J Protein Chem 1994May;13(4):375-80); AMBER 1.1 force field (Weiner, et al., J. Am. Chem.Soc. v106, pp765-784); AMBER 3.0 force field (U. C. Singh et al., Proc.Natl. Acad. Sci. USA. 82:755-759); CHARMM and CHARMM22 (Brooks, et al.,J. Comp. Chem. v4, pp 187-217); cvff3.0 (Dauber-Osguthorpe, etal.,(1988) Proteins: Structure, Function and Genetics, v4,pp31-47);cff91 (Maple, et al., J. Comp. Chem. v15, 162-182); also, the DISCOVER(cvff and cff91) and AMBER forcefields are used in the INSIGHT molecularmodeling package (Biosym/MSI, San Deigo Calif.) and HARMM is used in theQUANTA molecular modeling package (Biosym/MSI, San Deigo Calif.), all ofwhich are expressly incorporated by reference.

[0083] Once the singles and doubles energies are calculated and stored,the next step of the computational processing may occur. As outlined inU.S. Ser. No. 09/127,926 and PCT US98/07254, preferred embodimentsutilize a Dead End Elimination (DEE) step, and preferably a Monte Carlostep.

[0084] PDA, viewed broadly, has three components that may be varied toalter the output (e.g. the primary library): the scoring functions usedin the process; the filtering technique, and the sampling technique.

[0085] In a preferred embodiment, the scoring functions may be altered.In a preferred embodiment, the scoring functions outlined above may bebiased or weighted in a variety of ways. For example, a bias towards oraway from a reference sequence or family of sequences can be done; forexample, a bias towards wild-type or homolog residues may be used.Similarly, the entire protein or a fragment of it may be biased; forexample, the active site may be biased towards wild-type residues, ordomain residues towards a particular desired physical property can bedone. Furthermore, a bias towards or against increased energy can begenerated. Additional scoring function biases include, but are notlimited to applying electrostatic potential gradients or hydrophobicitygradients, adding a substrate or binding partner to the calculation, orbiasing towards a desired charge or hydrophobicity.

[0086] In addition, in an alternative embodiment, there are a variety ofadditional scoring functions that may be used. Additional scoringfunctions include, but are not limited to torsional potentials, orresidue pair potentials, or residue entropy potentials. Such additionalscoring functions can be used alone, or as functions for processing thelibrary after it is scored initially. For example, a variety offunctions derived from data on binding of peptides to MHC (MajorHistocompatibility Complex) can be used to rescore a library in order toeliminate proteins containing sequences which can potentially bind toMHC, i.e. potentially immunogenic sequences.

[0087] In a preferred embodiment, a variety of filtering techniques canbe done, including, but not limited to, DEE and its relatedcounterparts. Additional filtering techniques include, but are notlimited to branch-and-bound techniques for finding optimal sequences(Gordon and Majo, Structure Fold. Des. 7:1089-98, 1999), and exhaustiveenumeration of sequences. It should be noted however, that sometechniques may also be done without any filtering techniques; forexample, sampling techniques can be used to find good sequences, in theabsence of filtering.

[0088] As will be appreciated by those in the art, once an optimizedsequence or set of sequences is generated, (or again, these need not beoptimized or ordered) a variety of sequence space sampling methods canbe done, either in addition to the preferred Monte Carlo methods, orinstead of a Monte Carlo search. That is, once a sequence or set ofsequences is generated, preferred methods utilize sampling techniques toallow the generation of additional, related sequences for testing.

[0089] These sampling methods can include the use of amino acidsubstitutions, insertions or deletions, or recombinations of one or moresequences. As outlined herein, a preferred embodiment utilizes a MonteCarlo search, which is a series of biased, systematic, or random jumps.However, there are other sampling techniques that can be used, includingBoltzman sampling, genetic algorithm techniques and simulated annealing.In addition, for all the sampling techniques, the kinds of jumps allowedcan be altered (e.g. random jumps to random residues, biased jumps (toor away from wild-type, for example), jumps to biased residues (to oraway from similar residues, for example), etc.). Jumps where multipleresidue positions are coupled (two residues always change together, ornever change together), jumps where whole sets of residues change toother sequences (e.g., recombination). Similarly, for all the samplingtechniques, the acceptance criteria of whether a sampling jump isaccepted can be altered, to allow broad searches at high temperature andnarrow searches close to local optima at low temperatures. SeeMetropolis et al., J. Chem Phys v21, pp 1087, 1953, hereby expresslyincorporated by reference.

[0090] In addition, it should be noted that the preferred methods of theinvention result in a rank ordered list of sequences; that is, thesequences are ranked on the basis of some objective criteria. However,as outlined herein, it is possible to create a set of non-orderedsequences, for example by generating a probability table directly (forexample using SCMF analysis or sequence alignment techniques) that listssequences without ranking them. The sampling techniques outlined hereincan be used in either situation.

[0091] In a preferred embodiment, Boltzman sampling is done. As will beappreciated by those in the art, the temperature criteria for Boltzmansampling can be altered to allow broad searches at high temperature andnarrow searches close to local optima at low temperatures (see e.g.,Metropolis et al., J. Chem. Phys. 21:1087, 1953).

[0092] In a preferred embodiment, the sampling technique utilizesgenetic algorithms, e.g., such as those described by Holland (Adaptationin Natural and Artificial Systems, 1975, Ann Arbor, U. Michigan Press).Genetic algorithm analysis generally takes generated sequences andrecombines them computationally, similar to a nucleic acid recombinationevent, in a manner similar to “gene shuffling”. Thus the “jumps” ofgenetic algorithm analysis generally are multiple position jumps. Inaddition, as outlined below, correlated multiple jumps may also be done.Such jumps can occur with different crossover positions and more thanone recombination at a time, and can involve recombination of two ormore sequences. Furthermore, deletions or insertions (random or biased)can be done. In addition, as outlined below, genetic algorithm analysismay also be used after the secondary library has been generated.

[0093] In a preferred embodiment, the sampling technique utilizessimulated annealing, e.g., such as described by Kirkpatrick et al.(Science, 220:671-680, 1983). Simulated annealing alters the cutoff foraccepting good or bad jumps by altering the temperature. That is, thestringency of the cutoff is altered by altering the temperature. Thisallows broad searches at high temperature to new areas of sequencespace, altering with narrow searches at low temperature to exploreregions in detail.

[0094] In addition, as outlined below, these sampling methods can beused to further process a secondary library to generate additionalsecondary libraries (sometimes referred to herein as tertiarylibraries).

[0095] Thus, the primary library can be generated in a variety ofcomputational ways, including structure based methods such as PDA, orsequence based methods, or combinations as outlined herein.

[0096] Accordingly, the computational processing results in a set ofsequences, that may be optimized protein sequences if some sort ofranking or scoring functions are used. These optimized protein sequencesare generally, but not always, significantly different from thewild-type sequence from which the backbone was taken. That is, eachoptimized protein sequence preferably comprises at least about 5-10%variant amino acids from the starting or wild-type sequence, with atleast about 15-20% changes being preferred and at least about 30%changes being particularly preferred.

[0097] The cutoff for the primary library is then enforced, resulting ina set of primary sequences forming the primary library. As outlinedabove, this may be done in a variety of ways, including an arbitrarycutoff, an energy limitation, or when a certain number of residuepositions have been varied. In general, the size of the primary librarywill vary with the size of the protein, the number of residues that arechanging, the computational methods used, the cutoff applied and thediscretion of the user. In general, it is preferable to have the primarylibrary be large enough to randomly sample a reasonable sequence spaceto allow for robust secondary libraries. Thus, primary libraries thatrange from about 50 to about 10¹³ are preferred, with from about 1000 toabout 10⁷ being particularly preferred, and from about 1000 to about100,000 being especially preferred.

[0098] In a preferred embodiment when scoring is used, although this isnot required, the primary library comprises the globally optimalsequence in its optimal conformation, i.e. the optimum rotamer at eachvariable position. That is, computational processing is run until thesimulation program converges on a single sequence which is the globaloptimum. In a preferred embodiment, the primary library comprises atleast two optimized protein sequences. Thus for example, thecomputational processing step may eliminate a number of disfavoredcombinations but be stopped prior to convergence, providing a library ofsequences of which the global optimum is one. In addition, furthercomputational analysis, for example using a different method, may be runon the library, to further eliminate sequences or rank them differently.Alternatively, as is more fully described in U.S. Ser. Nos. 60/061,097,60/043,464, 60/054,678, 09/127,926 and PCT US98/07254, the globaloptimum may be reached, and then further computational processing mayoccur, which generates additional optimized sequences in theneighborhood of the global optimum.

[0099] In addition, in some embodiments, primary library sequences thatdid not make the cutoff are included in the primary library. This may bedesirable in some situations to evaluate the primary library generationmethod, to serve as controls or comparisons, or to sample additionalsequence space. For example, in a preferred embodiment, the wild-typesequence is included.

[0100] It should also be noted that different ranking systems can beused. For example, a list of naturally occurring sequences can be usedto calculate all possible recombinations of these sequences, with anoptional rank ordering step. Alternatively, once a primary library isgenerated, one could rank order only those recombinations that occur atcross-over points with at least a threshold of identity over a givenwindow. For example, 100% identity over a window of 6 amino acids, or80% identity over a window of 10 amino acids. Alternatively, as for allthe systems outlined herein, the homology could be considered at the DNAlevel, by computationally considering the translation for the aminoacids to their respective DNA codons. Different codon usages could beconsidered. A preferred embodiment considers only recombinations withcrossover points that have DNA sequence identity sufficient for DNAhybridization of the differing sequences.

[0101] As is further outlined below, It should also be noted thatcombining different primary libraries may be done. For example,positions in a protein that show a great deal of mutational diversity incomputational screening can be fixed as outlined below and a differentprimary library regenerated. A rank ordered list of the same length asthe first would now show diversity in previously rarely changingpositions. The variants from the first primary library can be combinedwith the variants from the second primary library to provide a combinedlibrary at lower computational cost than creating a very long rankordered list. This approach can be particularly useful to samplesequence diversity in both low energy gap, readily changing surfacepositions and high energy gap, rarely changing core positions. Inaddition, primary libraries can be generated by combining one or more ofthe different calculations to form one big primary library.

[0102] Thus, the present invention provides primary libraries comprisinga list of computationally derived sequences. In a preferred embodiment,these sequences are in the form of a rank ordered list. From thisprimary library, a secondary library is generated. As outlined herein,there are a number of different ways to generate a secondary library.

[0103] In a preferred embodiment, the primary library of the scaffoldprotein is used to generate a secondary library. As will be appreciatedby those in the art, the secondary library can be either a subset of theprimary library, or contain new library members, i.e. sequences that arenot found in the primary library. That is, in general, the variantpositions and/or amino acid residues in the variant positions can berecombined in any number of ways to form a new library that exploits thesequence variations found in the primary library. That is, havingidentified “hot spots” or important variant positions and/or residues,these positions can be recombined in novel ways to generate novelsequences to form a secondary library. Thus, in a preferred embodiment,the secondary library comprises at least one member sequence that is notfound in the primary library, and preferably a plurality of suchsequences.

[0104] In one embodiment, all or a portion of the primary library servesas the secondary library. That is, a cutoff is applied to the primarysequences and these sequences serve as the secondary library, withoutfurther manipulation or recombination. The library members can be madeas outlined below, e.g. by direct synthesis or by constructing thenucleic acids encoding the library members, expressing them in asuitable host, optionally followed by screening.

[0105] In a preferred embodiment, the secondary library is generated bytabulating the amino acid positions that vary from a reference sequence.The reference sequence can be arbitrarily selected, or preferably ischosen either as the wild-type sequence or the global optimum sequence,with the latter being preferred. That is, each amino acid position thatvaries in the primary library is tabulated. Of course, if the originalcomputational analysis fixed some positions, the variable positions ofthe secondary library will comprise either just these original variablepositions or some subset of these original variable positions. That is,assuming a protein of 100 amino acids, the original computational screencan allow all 100 positions to be varied. However, due to the cutoff inthe primary library, only positions may vary. Alternatively, assumingthe same 100 amino acid protein, the original computational screen couldhave varied only 25 positions, keeping the other 75 fixed; this couldresult in only 12 of the 25 being varied in the cutoff primary library.These primary library positions can then be recombined to form asecondary library, wherein all possible combinations of these variablepositions form the secondary library. It should be noted that thenon-variable positions are set to the reference sequence positions.

[0106] The formation of the secondary library using this method may bedone in two general ways; either all variable positions are allowed tobe any amino acid, or subsets of amino acids are allowed for eachposition.

[0107] In a preferred embodiment, all amino acid residues are allowed ateach variable position identified in the primary library. That is, oncethe variable positions are identified, a secondary library comprisingevery combination of every amino acid at each variable position is made.

[0108] In a preferred embodiment, subsets of amino acids are chosen. Thesubset at any position may be either chosen by the user, or may be acollection of the amino acid residues generated in the primary screen.That is, assuming core residue 25 is variable and the primary screengives 5 different possible amino acids for this position, the user maychose the set of good core residues outlined above (e.g. hydrophobicresidues), or the user may build the set by choosing the 5 differentamino acids generated in the primary screen. Alternatively, combinationsof these techniques may be used, wherein the set of identified residuesis manually expanded. For example, in some embodiments, fewer than thenumber of amino acid residues is chosen; for example, only three of thefive may be chosen. Alternatively, the set is manually expanded; forexample, if the computation picks two different hydrophobic residues,additional choices may be added. Similarly, the set may be biased, forexample either towards or away from the wild-type sequence, or towardsor away from known domains, etc.

[0109] In addition, this may be done by analyzing the primary library todetermine which amino acid positions in the scaffold protein have a highmutational frequency, and which positions have a low mutation frequency.The secondary library can be generated by randomizing the amino acids atthe positions that have high numbers of mutations, while keepingconstant the positions that do not have mutations above a certainfrequency. For example, if the position has less than 20% and morepreferably 10% mutations, it may be kept constant as the referencesequence position.

[0110] In a preferred embodiment, the secondary library is generatedfrom a probability distribution table. As outlined herein, there are avariety of methods of generating a probability distribution table,including using PDA, sequence alignments, forcefield calculations suchas SCMF calculations, etc. In addition, the probability distribution canbe used to generate information entropy scores for each position, as ameasure of the mutational frequency observed in the library.

[0111] In this embodiment, the frequency of each amino acid residue ateach variable position in the list is identified. Frequencies can bethresholded, wherein any variant frequency lower than a cutoff is set tozero. This cutoff is preferably 1%, 2%, 5%, 10% or 20%, with 10% beingparticularly preferred. These frequencies are then built into thesecondary library. That is, as above, these variable positions arecollected and all possible combinations are generated, but the aminoacid residues that “fill” the secondary library are utilized on afrequency basis. Thus, in a non-frequency based secondary library, avariable position that has 5 possible residues will have 20% of theproteins comprising that variable position with the first possibleresidue, 20% with the second, etc. However, in a frequency basedsecondary library, a variable position that has 5 possible residues withfrequencies of 10%, 15%, 25%, 30% and 20%, respectively, will have 10%of the proteins comprising that variable position with the firstpossible residue, 15% of the proteins with the second residue, 25% withthe third, etc. As will be appreciated by those in the art, the actualfrequency may depend on the method used to actually generate theproteins; for example, exact frequencies may be possible when theproteins are synthesized. However, when the frequency-based primersystem outlined below is used, the actual frequencies at each positionwill vary, as outlined below.

[0112] As will be appreciated by those in the art and outlined herein,probability distribution tables can be generated in a variety of ways.In addition to the methods outlined herein, self-consistent mean field(SCMF) methods can be used in the direct generation of probabilitytables. SCMF is a deterministic computational method that uses a meanfield description of rotamer interactions to calculate energies. Aprobability table generated in this way can be used to create secondarylibraries as described herein. SCMF can be used in three ways: thefrequencies of amino acids and rotamers for each amino acid are listedat each position; the probabilities are determined directly from SCMF(see Delarue et la. Pac. Symp. Biocomput. 109-21 (1997), expresslyincorporated by reference). In addition, highly variable positions andnon-variable positions can be identified. Alternatively, another methodis used to determine what sequence is jumped to during a search ofsequence space; SCMF is used to obtain an accurate energy for thatsequence; this energy is then used to rank it and create a rank-orderedlist of sequences (similar to a Monte Carlo sequence list). Aprobability table showing the frequencies of amino acids at eachposition can then be calculated from this list (Koehl et al., J. Mol.Biol. 239:249 (1994); Koehl et al., Nat. Struc. Biol. 2:163 (1995);Koehl et al., Curr. Opin. Struct. Biol. 6:222 (1996); Koehl et al., J.Mol. Bio. 293:1183 (1999); Koehl et al., J. Mol. Biol. 293:1161 (1999);Lee J. Mol. Biol. 236:918 (1994); and Vasquez Biopolymers 36:53-70(1995); all of which are expressly incorporated by reference. Similarmethods include, but are not limited to, OPLS-AA (Jorgensen, et al., J.Am. Chem. Soc. (1996), v 118, pp 11225-11236; Jorgensen, W. L.; BOSS,Version 4.1; Yale University: New Haven, Conn. (1999)); OPLS (Jorgensen,et al., J. Am. Chem. Soc. (1988), v 110, pp 1657ff; Jorgensen, et al., JAm. Chem. Soc. (1990), v 112, pp 4768ff); UNRES (United ResidueForcefield; Liwo, et al., Protein Science (1993), v 2, pp1697-1714;Liwo, et al., Protein Science (1993), v 2, pp1715-1731; Liwo, et al., J.Comp. Chem. (1997), v 18, pp849-873; Liwo, et al, J. Comp. Chem. (1997),v 18, pp874-884; Liwo, et al., J. Comp. Chem. (1998), v 19, pp259-276;Forcefield for Protein Structure Prediction (Liwo, et al., Proc. Natl.Acad. Sci. USA (1999), v 96, pp5482-5485); ECEPP/3 (Liwo et al., JProtein Chem 1994 May;13(4):375-80); AMBER 1.1 force field (Weiner, etal., J. Am. Chem. Soc. v 106, pp765-784); AMBER 3.0 force field (U. C.Singh et al., Proc. Natl. Acad. Sci. USA. 82:755-759); CHARMM andCHARMM22 (Brooks, et al., J. Comp. Chem. v4, pp 187-217); cvff3.0(Dauber-Osguthorpe, et al.,(1988) Proteins: Structure, Function andGenetics, v4,pp31-47); cff91 (Maple, et al., J. Comp. Chem. v15,162-182); also, the DISCOVER (cvff and cff91) and AMBER forcefields areused in the INSIGHT molecular modeling package (Biosym/MSI, San DeigoCalif.) and HARMM is used in the QUANTA molecular modeling package(Biosym/MSI, San Deigo Calif.).

[0113] In addition, as outlined herein, a preferred method of generatinga probability distribution table is through the use of sequencealignment programs. In addition, the probability table can be obtainedby a combination of sequence alignments and computational approaches.For example, one can add amino acids found in the alignment ofhomologous sequences to the result of the computation. Preferable onecan add the wild type amino acid identity to the probability table if itis not found in the computation.

[0114] As will be appreciated, a secondary library created byrecombining variable positions and/or residues at the variable positionmay not be in a rank-ordered list. In some embodiments, the entire listmay just be made and tested. Alternatively, in a preferred embodiment,the secondary library is also in the form of a rank ordered list. Thismay be done for several reasons, including the size of the secondarylibrary is still too big to generate experimentally, or for predictivepurposes. This may be done in several ways. In one embodiment, thesecondary library is ranked using the scoring functions of PDA to rankthe library members. Alternatively, statistical methods could be used.For example, the secondary library may be ranked by frequency score;that is, proteins containing the most of high frequency residues couldbe ranked higher, etc. This may be done by adding or multiplying thefrequency at each variable position to generate a numerical score.Similarly, the secondary library different positions could be weightedand then the proteins scored; for example, those containing certainresidues could be arbitrarily ranked.

[0115] As outlined herein, secondary libraries can be generated in twogeneral ways. The first is computationally, as above, wherein theprimary library is further computationally manipulated, for example byrecombining the possible variant positions and/or amino acid residues ateach variant position or by recombining portions of the sequencescontaining one or more variant position. It may be ranked, as outlinedabove. This computationally-derived secondary library can then beexperimentally generated by synthesizing the library members or nucleicacids encoding them, as is more fully outlined below. Alternatively, thesecondary library is made experimentally; that is, nucleic acidrecombination techniques are used to experimentally generate thecombinations. This can be done in a variety of ways, as outlined below.

[0116] In a preferred embodiment, the different protein members of thesecondary library may be chemically synthesized. This is particularlyuseful when the designed proteins are short, preferably less than 150amino acids in length, with less than 100 amino acids being preferred,and less than 50 amino acids being particularly preferred, although asis known in the art, longer proteins can be made chemically orenzymatically. See for example Wilken et al, Curr. Opin. Biotechnol.9:412-26 (1998), hereby expressly incorporated by reference.

[0117] In a preferred embodiment, particularly for longer proteins orproteins for which large samples are desired, the secondary librarysequences are used to create nucleic acids such as DNA which encode themember sequences and which can then be cloned into host cells, expressedand assayed, if desired. Thus, nucleic acids, and particularly DNA, canbe made which encodes each member protein sequence. This is done usingwell known procedures. The choice of codons, suitable expression vectorsand suitable host cells will vary depending on a number of factors, andcan be easily optimized as needed.

[0118] In a preferred embodiment, multiple PCR reactions with pooledoligonucleotides is done, as is generally depicted in FIG. 1. In thisembodiment, overlapping oligonucleotides are synthesized whichcorrespond to the full length gene. Again, these oligonucleotides mayrepresent all of the different amino acids at each variant position orsubsets.

[0119] In a preferred embodiment, these oligonucleotides are pooled inequal proportions and multiple PCR reactions are performed to createfull length sequences containing the combinations of mutations definedby the secondary library. In addition, this may be done usingerror-prone PCR methods.

[0120] In a preferred embodiment, the different oligonucleotides areadded in relative amounts corresponding to the probability distributiontable. The multiple PCR reactions thus result in full length sequenceswith the desired combinations of mutaions in the desired proportions.

[0121] The total number of oligonucleotides needed is a function of thenumber of positions being mutated and the number of mutations beingconsidered at these positions:

[0122] (number of oligos for constant positions) +M1+M2+M3+ . . .Mn=(total number of oligos required), where Mn is the number ofmutations considered at position n in the sequence.

[0123] In a preferred embodiment, each overlapping oligonucleotidecomprises only one position to be varied; in alternate embodiments, thevariant positions are too close together to allow this and multiplevariants per oligonucleotide are used to allow complete recombination ofall the possibilities. That is, each oligo can contain the codon for asingle position being mutated, or for more than one position beingmutated. The multiple positions being mutated must be close in sequenceto prevent the oligo length from being impractical. For multiplemutating positions on an oligonucleotide, particular combinations ofmutations can be included or excluded in the library by including orexcluding the oligonucleotide encoding that combination. For example, asdiscussed herein, there may be correlations between variable regions;that is, when position X is a certain residue, position Y must (or mustnot) be a particular residue. These sets of variable positions aresometimes referred to herein as a “cluster”. When the clusters arecomprised of residues close together, and thus can reside on oneoligonuclotide primer, the clusters can be set to the “good”correlations, and eliminate the bad combinations that may decrease theeffectiveness of the library. However, if the residues of the clusterare far apart in sequence, and thus will reside on differentoligonuclotides for synthesis, it may be desirable to either set theresidues to the “god” correlation, or eliminate them as variableresidues entirely. In an alternative embodiment,the library may begenerated in several steps, so that the cluster mutations only appeartogether. This procedure, i.e., the procedure of identifying mutationclusters and either placing them on the same oligonucleotides oreliminating them from the library or library generation in several stepspreserving clusters, can considerably enrich the experimental librarywith properly folded protein. Identification of clusters can be carriedout by a number of wasy, e.g. by using known pattern recognitionmethods, comparisons of frequencies of occurrence of mutations or byusing energy analysis of the sequences to be experimentally generated(for example, if the energy of interaction is high, the positions arecorrelated). these correlations may be positional correlations (e.g.variable positions 1 and 2 always change together or never changetogether) or sequence correlations (e.g. if there is a residue A atposition 1, there is always residue B at position 2). See: Patterndiscovery in Biomolecular Data: Tools, Techniques, and Applications;edited by Jason T. L. Wang, Bruce A. Shapiro, Dennis Shasha. New York:Oxford University, 1999; Andrews, Harry C. Introduction to mathematicaltechniques in patter recognition; New York, Wiley-Interscience [1972];Applications of Pattern Recognition; Editor, K. S. Fu. Boca Raton, Fla.CRC Press, 1982; Genetic Algorithms for Pattern Recognition; edited bySankar K. Pal, Paul P. Wang. Boca Raton : CRC Press, c1996; Pandya,Abhijit S., Pattern recognition with Neural networks in C++/Abhijit S.Pandya, Robert B. Macy. Boca Raton, Fla.: CRC Press, 1996; Handbook ofpattern recognition and computer vision/edited by C. H. Chen, L. F. Pau,P. S. P. Wang. 2^(nd) ed. Signapore ; River Edge, N.J.: WorldScientific, c1999; Friedman, Introduction to Pattern Recognition :Statistical, Structural, Neural, and Fuzzy Logic Approaches; River Edge,N.J.: World Scientific, c1999, Series title: Serien a machine perceptionand artificial intelligence; vol. 32; all of which are expresslyincorporated by reference. In addition programs used to search forconsensus motifs can be used as well.

[0124] In addition, correlations and shuffling can be fixed or optimizedby altering the design of the oligonucleotides; that is, by decidingwhere the oligonucleotides (primers) start and stop (e.g. where thesequences are “cut”). The start and stop sites of oligos can be set tomaximize the number of clusters that appear in single oligonucleotides,thereby enriching the library with higher scoring sequences. Differentoligonucleotides start and stop site options can be computationallymodeled and ranked according to number of clusters that are representedon single oligos, or the percentage of the resulting sequencesconsistent with the predicted libarary of sequences.

[0125] The total number of oligonucleotides required increases whenmultiple mutable positions are encoded by a single oligonucleotide. Theannealed regions are the ones that remain constant, i.e. have thesequence of the reference sequence.

[0126] Oligonucleotides with insertions or deletions of codons can beused to create a library expressing different length proteins. Inparticular computational sequence screening for insertions or deletionscan result in secondary libraries defining different length proteins,which can be expressed by a library of pooled oligonucleotide ofdifferent lengths.

[0127] In a preferred embodiment, the secondary library is done byshuffling the famil7 (e.g. a set of variants); that is, some set of thetop sequences (if a rank-ordered list is used) can be shuffled, eitherwith or without error-prone PCR. “Shuffling” in this context means arecombination of related sequences, generally in a random way. It caninclude “shuffling” as defined and exemplified in U.S. Pat. Nos.5,830,721; 5,811,238; 5,605,793; 5,837,458 and PCT US/1 9256, all ofwhich are expressly incorporated by reference in their entirety. Thisset of sequences can also be an artificial set; for example, from aprobability table (for example generated using SCMF) or a Monte Carloset. Similarly, the “family” can be the top 10 and the bottom 10sequences, the top 100 sequence, etc. This may also be done usingerror-prone PCR.

[0128] Thus, in a preferred embodiment, in silico shuffling is doneusing the computational methods described therein. That is, startingwith either two libraries or two sequences, random recombinations of thesequences can be generated and evaluated.

[0129] In a preferred embodiment, error-prone PCR is done to generatethe secondary library. See U.S. Pat. Nos. 5,605,793, 5,811,238, and5,830,721, all of which are hereby incorporated by reference. This canbe done on the optimal sequence or on top members of the library, orsome other artificial set or family. In this embodiment, the gene forthe optimal sequence found in the computational screen of the primarylibrary can be synthesized. Error prone PCR is then performed on theoptimal sequence gene in the presence of oligonucleotides that code forthe mutations at the variant positions of the secondary library (biasoligonucleotides). The addition of the oligonucleotides will create abias favoring the incorporation of the mutations in the secondarylibrary. Alternatively, only oligonucleotides for certain mutations maybe used to bias the library.

[0130] In a preferred embodiment, gene shuffling with error prone PCRcan be performed on the gene for the optimal sequence, in the presenceof bias oligonucleotides, to create a DNA sequence library that reflectsthe proportion of the mutations found in the secondary library. Thechoice of the bias oligonucleotides can be done in a variety of ways;they can chosen on the basis of their frequency, i.e. oligonucleotidesencoding high mutational frequency positions can be used; alternatively,oligonucleotides containing the most variable positions can be used,such that the diversity is increased; if the secondary library isranked, some number of top scoring positions can be used to generatebias oligonucleotides; random positions may be chosen; a few top scoringand a few low scoring ones may be chosen; etc. What is important is togenerate new sequences based on preferred variable positions andsequences.

[0131] In a preferred embodiment, PCR using a wild type gene or othergene can be used, as is schematically depicted in FIG. 5. In thisembodiment, a starting gene is used; generally, although this is notrequired, the gene is the wild type gene. In some cases it may be thegene encoding the global optimized sequence, or any other sequence ofthe list. In this embodiment, oligonucleotides are used that correspondto the variant positions and contain the different amino acids of thesecondary library. PCR is done using PCR primers at the termini, as isknown in the art. This provides two benefits; the first is that thisgenerally requires fewer oligonucleotides and can result in fewererrors. In addition, it has experimental advantages in that if the wildtype gene is used, it need not be synthesized.

[0132] In a preferred embodiment, a variety of additional steps may bedone to one or more secondary libraries; for example, furthercomputational processing can occur, secondary libraries can berecombined, or cutoffs from different secondary libraries can becombined. In a preferred embodiment, a secondary library may becomputationally remanipulated to form an additional secondary library(sometimes referred to herein as “tertiary libraries”). For example, anyof the secondary library sequences may be chosen for a second round ofPDA, by freezing or fixing some or all of the changed positions in thefirst secondary library. Alternatively, only changes seen in the lastprobability distribution table are allowed. Alternatively, thestringency of the probability table may be altered, either by increasingor decreasing the cutoff for inclusion. Similarly, the secondary librarymay be recombined experimentally after the first round; for example, thebest gene/genes from the first screen may be taken and gene assemblyredone (using techniques outlined below, multiple PCR, error prone PCR,shuffling, etc.). Alternatively, the fragments from one or more goodgene(s) to change probabilities at some positions. This biases thesearch to an area of sequence space found in the first round ofcomputational and experimental screening.

[0133] In a preferred embodiment, a tertiary libarary can be generatedfrom combining secondary libraries. For example, a probabilitydistribution table from a secondary library can be generated andrecombined, wither computationally or experimentally, as outlinedherein. A PDA secondary library may be combined with a sequencealignment secondary library, and either recombined (again,computationally or experimentally) or just the cutoffs from each joinedto make a new tertiary library. The top sequences from several librariescan be recombined. Primary and secondary libraries can similarly becombined. Sequences from the top of a library can be combined withsequences from the bottom of the library to more broadly sample sequencespace, or only sequences distant from the top of the library can becombined. Primary and/or secondary libraries that analyzed differentparts of a protein can be combined to a tertiary library that treats thecombined parts of the protein. These combinations can be done to analyzelarge proteins, especially large multidomain proteins or completeprotesomes.

[0134] In a preferred embodiment, a tertiary library can be generatedusing correlations in the secondary library. That is, a residue at afirst variable position may be correlated to a residue at secondvariable position (or correlated to residues at additional positions aswell). For example, two variable positions may sterically orelectrostatically interact, such that if the first residue is X, thesecond residue must be Y. This may be either a positive or negativecorrelation. This correlation, or “cluster” of residues, may be bothdetected and used in a variety of ways. (For the generation ofcorrelations, see the earlier cited art).

[0135] In addition, primary and secondary libraries can be combined toform new libaries; these can be random combinations or the libraries,combining the “top” sequences, or weighting the combinations (positionsor residues from the first library are scored higher than those of thesecond library).

[0136] As outlined herein, any number of protein attributes may bealtered in these methods, including, but not limited to, enzymeactivity, stability, solubility, aggregation, binding affinity, bindingspecificity, substrate specificity, structural integrity,immunogenicity, toxicity, generate peptide and peptidomimmeticlibraries, create new antibody CDR's, generate new DNA, RNA bindings,etc.

[0137] It should be noted that therapeutic proteins utilized in thesemethods will preferentially have residues in the hydrophobic coresscreened, to prevent changes in the molecular surface of the proteinthat might induce immunogenic responses. Therapeutic proteins can alsobe designed in the region surrounding their binding sites to theirreceptors. Such a region can be defined, for example, by including inthe design all residues within a certain distance, for example 4.5 Å ofthe binding site residues. This range can vary from 4 to 6-10 Å. Thisdesign will serve to improve activity and specificity.

[0138] In addition, a step method can be done; see Zhao et al., NatureBiotech. 16:258 (1998), hereby incorporated by reference.

[0139] In a preferred embodiment, the methods of the invention are usednot on known scaffold proteins, but on random peptides, to search avirtual library for those sequences likely to adapt a stableconformation. As discussed above, there is a current benefit and focuson screening random peptide libraries to find novel binding/modulators.However, the sequences in these experimental libraries can be randomizedat specific sites only, or throughout the sequence. The number ofsequences that can be searched in these libraries grows expontentiallywith the number of positions that are randomized. Generally, only up to10¹²- 10¹⁵ sequences can be contained in a library because of thephysical constraints of laboratories (the size of the instruments, thecost of producing large numbers of biopolymers, etc.). Other practicalconsiderations can often limit the size of the libraries to 10⁶ orfewer. These limits are reached for only 10 amino acid positions.Therefore, only a sparse sampling of sequences is possible in the searchfor improved proteins or peptides in experimental sequence libraries,lowering the chance of success and almost certainly missing desirablecandidates. Because of the randomness of the changes in these sequences,most of the candidates in the library are not suitable, resulting in awaste of most of the effort in producing the library.

[0140] However, using the automated protein design techniques outlinedherein, virtual libraries of protein sequences can be generated that arevastly larger than experimental libraries. Up to 10⁷⁵ candidatesequences (or more) can be screened computationally and those that meetdesign criteria which favor stable and functional proteins can bereadily selected. An experimental library consisting of the favorablecandidates found in the virtual library screening can then be generated,resulting in a much more efficient use of the experimental library andovercoming the limitations of random protein libraries. Thus, themethods of the invention allow the virtual screening of a set of randompeptides for peptides likely to take on a particular structure, and thuseliminating the large number of unpreferred or unallowed conformationswithout having to make and test the peptides.

[0141] In addition, it is possible to randomize regions or domains ofprotein as well.

[0142] Thus, in a preferred embodiment, the invention provides librariesof completely defined set of variant scaffold proteins, wherein at least85% of the possible members are in the library, with at least about 90%and 95% being particularly preferred. However, it is also possible thaterrors are introduced into the libraries experimentally, and thus thelibraries contain preferably less than 25% non-defined (e.g. error)sequences; with less than 10%, less than 5% and less than 1%particularly preferred. Thus libraries that have all members as well assome error members, or some members as well as error members areincluded herein.

[0143] As mentioned above, two principle benefits come from the virtuallibrary screening: (1) the automated protein design generates a list ofsequence candidates that are favored to meet design criteria; it alsoshows which positions in the sequence are readily changed and whichpositions are unlikely to change without disrupting protein stabilityand function. An experimental random library can be generated that isonly randomized at the readily changeable, non-disruptive sequencepositions. (2) The diversity of amino acids at these positions can belimited to those that the automated design shows are compatible withthese positions. Thus, by limiting the number of randomized positionsand the number of possibilities at these positions, the number of wastedsequences produced in the experimental library is reduced, therebyincreasing the probability of success in finding sequences with usefulproperties.

[0144] For example, the table below lists the 10 favored sequencescandidates from the virtual screening of 12 positions in a protein. Itshows that positions 9, 10 and 12 are most likely to have changes thatdo not disrupt the function of the protein, suggesting that a randomexperimental library that randomizes positions 9, 10 and 12 will have ahigher fraction of desirable sequences. Also, the virtual librarysuggests that position 10 is most compatible with lie or Phe residues,further limiting the size of the library and allowing a more completescreening of good sequences. 1 2 3 4 5 6 7 8 9 10 11 12 1 LEU LEU ILEILE ALA LEU LEU LEU LEU PHE ALA LEU 2 LEU LEU ILE ILE ALA LEU LEU LEULEU ILE ALA LEU 3 LEU LEU ILE ILE ALA LEU LEU LEU LEU ILE ALA LEU 4 LEULEU ILE ILE ALA LEU LEU LEU LEU PHE ALA ILE 5 LEU LEU ILE ILE ALA LEULEU LEU LEU PHE ALA ILE 6 LEU LEU ILE ILE ALA LEU LEU LEU LEU ILE ALAILE 7 LEU LEU ILE ILE ALA LEU LEU LEU ILE PHE ALA LEU 8 LEU LEU ILE ILEALA LEU LEU LEU LEU ILE ALA ILE 9 LEU LEU ILE ILE ALA LEU LEU LEU ILEPHE ALA LEU 10 LEU LEU ILE ILE ALA LEU LEU LEU LEU LEU ALA LEU

[0145] The automated design method uses physical chemical criteria toscreen sequences, resulting in sequences that are likely to be stable,structured, and that preserve function, if needed. Different designcriteria can be used to produce candidate sets that are biased forproperties such as charged, solubility, or active site characteristics(polarity, size), or are biased to have certain amino acids at certainpositions. That is, The candidate bioactive agents and candidate nucleicacids are randomized, either fully randomized or they are biased intheir randomization, e.g. in nucleotide/residue frequency generally orper position. By “randomized” or grammatical equivalents herein is meantthat each nucleic acid and peptide consists of essentially randomnucleotides and amino acids, respectively. Thus, any amino acid residuemay be incorporated at any position. The synthetic process can bedesigned to generate randomized peptides and/or nucleic acids, to allowthe formation of all or most of the possible combinations over thelength of the nucleic acid, thus forming a library of randomizedcandidate nucleic acids.

[0146] In one embodiment, the library is fully randomized, with nosequence preferences or constants at any position. In a preferredembodiment, the library is biased. That is, some positions within thesequence are either held constant, or are selected from a limited numberof possibilities. For example, in a preferred embodiment, thenucleotides or amino acid residues are randomized within a definedclass, for example, of hydrophobic amino acids, hydrophilic residues,sterically biased (either small or large) residues, towards the creationof cysteines, for cross-linking, prolines for SH-3 domains, serines,threonines, tyrosines or histidines for phosphorylation sites, etc., orto purines, etc.

[0147] In a preferred embodiment, the bias is towards peptides ornucleic acids that interact with known classes of molecules. Forexample, it is known that much of intracellular signaling is carried outvia short regions of polypeptides interacting with other polypeptidesthrough small peptide domains. For instance, a short region from theHIV-1 envelope cytoplasmic domain has been previously shown to block theaction of cellular calmodulin. Regions of the Fas cytoplasmic domain,which shows homology to the mastoparan toxin from Wasps, can be limitedto a short peptide region with death-inducing apoptotic or G proteininducing functions. Magainin, a natural peptide derived from Xenopus,can have potent anti-tumour and anti-microbial activity. Short peptidefragments of a protein kinase C isozyme (βPKC), have been shown to blocknuclear translocation of βPKC in Xenopus oocytes following stimulation.And, short SH-3 target peptides have been used as psuedosubstrates forspecific binding to SH-3 proteins. This is of course a short list ofavailable peptides with biological activity, as the literature is densein this area. Thus, there is much precedent for the potential of smallpeptides to have activity on intracellular signaling cascades. Inaddition, agonists and antagonists of any number of molecules may beused as the basis of biased randomization of candidate bioactive agentsas well.

[0148] In general, the generation of a prescreened random peptidelibraries may be described as follows. Any structure, whether a knownstructure, for example a portion of a known protein, a known peptide,etc., or a synthetic structure, can be used as the backbone for PDA. Forexample, structures from X-ray crystallographic techniques, NMRtechniques, de novo modelling, homology modelling, etc. may all be usedto pick a backbone for which sequences are desired. Similarly, a numberof molecules or protein domains are suitable as starting points for thegeneration of biased randomized candidate bioactive agents. A largenumber of small molecule domains are known, that confer a commonfunction, structure or affinity. In addition, as is appreciated in theart, areas of weak amino acid homology may have strong structuralhomology. A number of these molecules, domains, and/or correspondingconsensus sequences, are known, including, but are not limited to, SH-2domains, SH-3 domains, Pleckstrin, death domains, proteasecleavage/recognition sites, enzyme inhibitors, enzyme substrates, Traf,etc. Similarly, there are a number of known nucleic acid bindingproteins containing domains suitable for use in the invention. Forexample, leucine zipper consensus sequences are known.

[0149] Thus, in general, known peptide ligands can be used as thestarting backbone for the generation of the primary library.

[0150] In addition, structures known to take on certain conformationsmay be used to create a backbone, and then sequences screened for thosethat are likely to take on that conformation. For example, there are awide variety of “ministructures” known, sometimes referred to as“presentation structures”, that can confer conformational stability orgive a random sequence a conformationally restricted form. Proteinsinteract with each other largely through conformationally constraineddomains. Although small peptides with freely rotating amino and carboxyltermini can have potent functions as is known in the art, the conversionof such peptide structures into pharmacologic agents is difficult due tothe inability to predict side-chain positions for peptidomimeticsynthesis. Therefore the presentation of peptides in conformationallyconstrained structures will benefit both the later generation ofpharmaceuticals and will also likely lead to higher affinityinteractions of the peptide with the target protein. This fact has beenrecognized in the combinatorial library generation systems usingbiologically generated short peptides in bacterial phage systems. Anumber of workers have constructed small domain molecules in which onemight present randomized peptide structures.

[0151] Thus, synthetic presentation structures, i.e. artificialpolypeptides, are capable of presenting a randomized peptide as aconformationally-restricted domain. Preferred presentation structuresmaximize accessibility to the peptide by presenting it on an exteriorloop. Accordingly, suitable presentation structures include, but are notlimited to, minibody structures, loops on beta-sheet turns andcoiled-coil stem structures in which residues not critical to structureare randomized, zinc-finger domains, cysteine-linked (disulfide)structures, transglutaminase linked structures, cyclic peptides, B-loopstructures, helical barrels or bundles, leucine zipper motifs, etc.

[0152] In a preferred embodiment, the presentation structure is acoiled-coil structure, allowing the presentation of the randomizedpeptide on an exterior loop. See, for example, Myszka et al., Biochem.33:2362-2373 (1994), hereby incorporated by reference, and FIG. 3).Using this system investigators have isolated peptides capable of highaffinity interaction with the appropriate target. In general,coiled-coil structures allow for between 6 to 20 randomized positions;(see Martin et al., EMBO J. 13(22):5303-5309 (1994), incorporated byreference).

[0153] In a preferred embodiment, the presentation structure is aminibody structure. A “minibody” is essentially composed of a minimalantibody complementarity region. The minibody presentation structuregenerally provides two randomizing regions that in the folded proteinare presented along a single face of the tertiary structure. See forexample Bianchi et al., J. Mol. Biol. 236(2):649-59 (1994), andreferences cited therein, all of which are incorporated by reference).Investigators have shown this minimal domain is stable in solution andhave used phage selection systems in combinatorial libraries to selectminibodies with peptide regions exhibiting high affinity, Kd=10⁻⁷, forthe pro-inflammatory cytokine IL-6.

[0154] Once the backbone is chosen and the primary library of the randompeptides generated as outlined above, the secondary library generationand creation proceeds as for the known scaffold protein, includingrecombination of variant positions and/or amino acid residues, eithercomputationally or experimentally. Again, libraries of DNA expressingthe protein sequences defined by the automated protein design methodscan be produced. Codons can be randomized at only the nucleotidesequence triplets that define the residue positions specified by theautomated design method. Also, mixtures of base triplets that code forparticular amino acids could be introduced into the DNA synthesisreaction to attach a full triplet defining an amino acid in one reactionstep. Also, a library of random DNA oligomers could be designed thatbiases the desired positions toward certain amino acids, or thatrestricts those positions to certain amino acids. The amino acids biasedfor would be those specified in the virtual screening, or a subset ofthose.

[0155] Multiple DNA libraries are synthesized that code for differentsubsets of amino acids at certain positions, allowing generation of theamino acid diversity desired without having to fully randomize the codonand thereby waste sequences in the library on stop codons, frameshifts,undesired amino acids, etc. This can be done by creating a library thatat each position to be randomized is only randomized at one or two ofthe positions of the triplet, where the position(s) left constant arethose that the amino acids to be considered at this position have incommon. Multiple DNA libraries would be created to insure that all aminoacids desired at each position exist in the aggregate library.Alternatively, “shuffling”, as is generally known in the art, can bedone with multiple libraries. In addition, in silico shuffling can alsobe done.

[0156] Alternatively, the random peptide libraries may be done using thefrequency tabulation and experimental generation methods includingmultiplexed PCR, shuffling, etc. There are a wide variety ofexperimental techniques that can be used to experimentally generate thelibraries of the invention, including, but not limited to,Rachitt-Enchira (http://www.enchira.com/gene_shuffling.htm); error-pronePCR, for example using modified nucleotides; known mutagenesistechniques including the use of multi-cassettes; DNA shuffling (Crameri,et al., Nature 391(6664):288-291. (1998)); heterogeneous DNA samples(U.S. Pat No. 5,939,250); ITCHY (Ostermeier, et al., Nat Biotechnol17(12):1205-1209. (1999)); StEP (Zhao, et al., Nat Biotechnol16(3):258-261. (1998)), GSSM (U.S. Pat. No. 6,171,820,U.S. Pat. No.5,965,408); in vivo homologous recombination, ligase assisted geneassembly, end-complementary PCR, profusion (Roberts and Szostak, ProcNatl Acad Sci USA 94(23):12297-12302. (1997)); yeast/bacteria surfacedisplay (Lu, et al., Biotechnology (NY) 13(4):366-372. (1995);Seed andAruffo, Proc Natl Acad Sci USA 84(10):3365-3369. (1987);Boder andWittrup, Nat Biotechnol 15(6):553-557. (1997)).

[0157] Using the nucleic acids of the present invention which encodelibrary members, a variety of expression vectors are made. Theexpression vectors may be either self-replicating extrachromosomalvectors or vectors which integrate into a host genome. Generally, theseexpression vectors include transcriptional and translational regulatorynucleic acid operably linked to the nucleic acid encoding the libraryprotein. The term “control sequences” refers to DNA sequences necessaryfor the expression of an operably linked coding sequence in a particularhost organism. The control sequences that are suitable for prokaryotes,for example, include a promoter, optionally an operator sequence, and aribosome binding site. Eukaryotic cells are known to utilize promoters,polyadenylation signals, and enhancers.

[0158] Nucleic acid is “operably linked” when it is placed into afunctional relationship with another nucleic acid sequence. For example,DNA for a presequence or secretory leader is operably linked to DNA fora polypeptide if it is expressed as a preprotein that participates inthe secretion of the polypeptide; a promoter or enhancer is operablylinked to a coding sequence if it affects the transcription of thesequence; or a ribosome binding site is operably linked to a codingsequence if it is positioned so as to facilitate translation. Generally,“operably linked” means that the DNA sequences being linked arecontiguous, and, in the case of a secretory leader, contiguous and inreading phase. However, enhancers do not have to be contiguous. Linkingis accomplished by ligation at convenient restriction sites. If suchsites do not exist, the synthetic oligonucleotide adaptors or linkersare used in accordance with conventional practice. The transcriptionaland translational regulatory nucleic acid will generally be appropriateto the host cell used to express the library protein, as will beappreciated by those in the art; for example, transcriptional andtranslational regulatory nucleic acid sequences from Bacillus arepreferably used to express the library protein in Bacillus. Numeroustypes of appropriate expression vectors, and suitable regulatorysequences are known in the art for a variety of host cells.

[0159] In general, the transcriptional and translational regulatorysequences may include, but are not limited to, promoter sequences,ribosomal binding sites, transcriptional start and stop sequences,translational start and stop sequences, and enhancer or activatorsequences. In a preferred embodiment, the regulatory sequences include apromoter and transcriptional start and stop sequences.

[0160] Promoter sequences include constitutive and inducible promotersequences. The promoters may be either naturally occurring promoters,hybrid or synthetic promoters. Hybrid promoters, which combine elementsof more than one promoter, are also known in the art, and are useful inthe present invention.

[0161] In addition, the expression vector may comprise additionalelements. For example, the expression vector may have two replicationsystems, thus allowing it to be maintained in two organisms, for examplein mammalian or insect cells for expression and in a prokaryotic hostfor cloning and amplification. Furthermore, for integrating expressionvectors, the expression vector contains at least one sequence homologousto the host cell genome, and preferably two homologous sequences whichflank the expression construct. The integrating vector may be directedto a specific locus in the host cell by selecting the appropriatehomologous sequence for inclusion in the vector. Constructs forintegrating vectors and appropriate selection and screening protocolsare well known in the art and are described in e.g., Mansour et al.,Cell, 51:503 (1988) and Murray, Gene Transfer and Expression Protocols,Methods in Molecular Biology, Vol. 7 (Clifton: Humana Press, 1991).

[0162] In addition, in a preferred embodiment, the expression vectorcontains a selection gene to allow the selection of transformed hostcells containing the expression vector, and particularly in the case ofmammalian cells, ensures the stability of the vector, since cells whichdo not contain the vector will generally die. Selection genes are wellknown in the art and will vary with the host cell used. By “selectiongene” herein is meant any gene which encodes a gene product that confersresistance to a selection agent. Suitable selection agents include, butare not limited to, neomycin (or its analog G418), blasticidin S,histinidol D, bleomycin, puromycin, hygromycin B, and other drugs.

[0163] In a preferred embodiment, the expression vector contains a RNAsplicing sequence upstream or downstream of the gene to be expressed inorder to increase the level of gene expression. See Barret et al.,Nucleic Acids Res. 1991; Groos et al., Mol. Cell. Biol. 1987; andBudiman et al., Mol. Cell. Biol. 1988.

[0164] A preferred expression vector system is a retroviral vectorsystem such as is generally described in Mann et al., Cell, 33:153-9(1993); Pear et al., Proc. Natl. Acad. Sci. U.S.A., 90(18):8392-6(1993); Kitamura et al., Proc. Natl. Acad. Sci. U.S.A., 92:9146-50(1995); Kinsella et al., Human Gene Therapy, 7:1405-13; Hofmann etal.,Proc. Natl. Acad. Sci. U.S.A., 93:5185-90; Choate et al., Human GeneTherapy, 7:2247 (1996); PCT/US97101019 and PCT/U.S. Pat. No. 97/01048,and references cited therein, all of which are hereby expresslyincorporated by reference.

[0165] The library proteins of the present invention are produced byculturing a host cell transformed with nucleic acid, preferably anexpression vector, containing nucleic acid encoding an library protein,under the appropriate conditions to induce or cause expression of thelibrary protein. The conditions appropriate for library proteinexpression will vary with the choice of the expression vector and thehost cell, and will be easily ascertained by one skilled in the artthrough routine experimentation. For example, the use of constitutivepromoters in the expression vector will require optimizing the growthand proliferation of the host cell, while the use of an induciblepromoter requires the appropriate growth conditions for induction. Inaddition, in some embodiments, the timing of the harvest is important.For example, the baculoviral systems used in insect cell expression arelytic viruses, and thus harvest time selection can be crucial forproduct yield.

[0166] As will be appreciated by those in the art, the type of cellsused in the present invention can vary widely. Basically, a wide varietyof appropriate host cells can be used, including yeast, bacteria,archaebacteria, fungi, and insect and animal cells, including mammaliancells. Of particular interest are Drosophila melanogaster cells,Saccharomyces cerevisiae and other yeasts, E. coli, Bacillus subtilis,SF9 cells, C129 cells, 293 cells, Neurospora, BHK, CHO, COS, and HeLacells, fibroblasts, Schwanoma cell lines, immortalized mammalian myeloidand lymphoid cell lines, Jurkat cells, mast cells and other endocrineand exocrine cells, and neuronal cells. See the ATCC cell line catalog,hereby expressly incorporated by reference. In addition, the expressionof the secondary libraries in phage display systems, such as are wellknown in the art, are particularly preferred, especially when thesecondary library comprises random peptides. In one embodiment, thecells may be genetically engineered, that is, contain exogeneous nucleicacid, for example, to contain target molecules.

[0167] In a preferred embodiment, the library proteins are expressed inmammalian cells. Any mammalian cells may be used, with mouse, rat,primate and human cells being particularly preferred, although as willbe appreciated by those in the art, modifications of the system bypseudotyping allows all eukaryotic cells to be used, preferably highereukaryotes. As is more fully described below, a screen will be set upsuch that the cells exhibit a selectable phenotype in the presence of arandom library member. As is more fully described below, cell typesimplicated in a wide variety of disease conditions are particularlyuseful, so long as a suitable screen may be designed to allow theselection of cells that exhibit an altered phenotype as a consequence ofthe presence of a library member within the cell.

[0168] Accordingly, suitable mammalian cell types include, but are notlimited to, tumor cells of all types (particularly melanoma, myeloidleukemia, carcinomas of the lung, breast, ovaries, colon, kidney,prostate, pancreas and testes), cardiomyocytes, endothelial cells,epithelial cells, lymphocytes (T-cell and B cell) , mast cells,eosinophils, vascular intimal cells, hepatocytes, leukocytes includingmononuclear leukocytes, stem cells such as haemopoetic, neural, skin,lung, kidney, liver and myocyte stem cells (for use in screening fordifferentiation and de-differentiation factors), osteoclasts,chondrocytes and other connective tissue cells, keratinocytes,melanocytes, liver cells, kidney cells, and adipocytes. Suitable cellsalso include known research cells, including, but not limited to, JurkatT cells, NIH3T3 cells, CHO, Cos, etc. See the ATCC cell line catalog,hereby expressly incorporated by reference.

[0169] Mammalian expression systems are also known in the art, andinclude retroviral systems. A mammalian promoter is any DNA sequencecapable of binding mammalian RNA polymerase and initiating thedownstream (3′) transcription of a coding sequence for library proteininto mRNA. A promoter will have a transcription initiating region, whichis usually placed proximal to the 5′ end of the coding sequence, and aTATA box, using a located 25-30 base pairs upstream of the transcriptioninitiation site. The TATA box is thought to direct RNA polymerase 11 tobegin RNA synthesis at the correct site. A mammalian promoter will alsocontain an upstream promoter element (enhancer element), typicallylocated within 100 to 200 base pairs upstream of the TATA box. Anupstream promoter element determines the rate at which transcription isinitiated and can act in either orientation. Of particular use asmammalian promoters are the promoters from mammalian viral genes, sincethe viral genes are often highly expressed and have a broad host range.Examples include the SV40 early promoter, mouse mammary tumor virus LTRpromoter, adenovirus major late promoter, herpes simplex virus promoter,and the CMV promoter.

[0170] Typically, transcription termination and polyadenylationsequences recognized by mammalian cells are regulatory regions located3′ to the translation stop codon and thus, together with the promoterelements, flank the coding sequence. The 3′ terminus of the mature mRNAis formed by site-specific post-translational cleavage andpolyadenylation. Examples of transcription terminator and polyadenlytionsignals include those derived form SV40.

[0171] The methods of introducing exogenous nucleic acid into mammalianhosts, as well as other hosts, is well known in the art, and will varywith the host cell used. Techniques include dextran-mediatedtransfection, calcium phosphate precipitation, polybrene mediatedtransfection, protoplast fusion, electroporation, viral infection,encapsulation of the polynucleotide(s) in liposomes, and directmicroinjection of the DNA into nuclei.

[0172] In a preferred embodiment, library proteins are expressed inbacterial systems. Bacterial expression systems are well known in theart.

[0173] A suitable bacterial promoter is any nucleic acid sequencecapable of binding bacterial RNA polymerase and initiating thedownstream (3′) transcription of the coding sequence of library proteininto mRNA. A bacterial promoter has a transcription initiation regionwhich is usually placed proximal to the 5′ end of the coding sequence.This transcription initiation region typically includes an RNApolymerase binding site and a transcription initiation site. Sequencesencoding metabolic pathway enzymes provide particularly useful promotersequences. Examples include promoter sequences derived from sugarmetabolizing enzymes, such as galactose, lactose and maltose, andsequences derived from biosynthetic enzymes such as tryptophan.Promoters from bacteriophage may also be used and are known in the art.In addition, synthetic promoters and hybrid promoters are also useful;for example, the tac promoter is a hybrid of the trp and lac promotersequences. Furthermore, a bacterial promoter can include naturallyoccurring promoters of non-bacterial origin that have the ability tobind bacterial RNA polymerase and initiate transcription.

[0174] In addition to a functioning promoter sequence, an efficientribosome binding site is desirable. In E. coli, the ribosome bindingsite is called the Shine-Delgarno (SD) sequence and includes aninitiation codon and a sequence 3-9 nucleotides in length located 3- 11nucleotides upstream of the initiation codon.

[0175] The expression vector may also include a signal peptide sequencethat provides for secretion of the library protein in bacteria. Thesignal sequence typically encodes a signal peptide comprised ofhydrophobic amino acids which direct the secretion of the protein fromthe cell, as is well known in the art. The protein is either secretedinto the growth media (gram-positive bacteria) or into the periplasmicspace, located between the inner and outer membrane of the cell(gram-negative bacteria).

[0176] The bacterial expression vector may also include a selectablemarker gene to allow for the selection of bacterial strains that havebeen transformed. Suitable selection genes include genes which renderthe bacteria resistant to drugs such as ampicillin, chloramphenicol,erythromycin, kanamycin, neomycin and tetracycline. Selectable markersalso include biosynthetic genes, such as those in the histidine,tryptophan and leucine biosynthetic pathways.

[0177] These components are assembled into expression vectors.Expression vectors for bacteria are well known in the art, and includevectors for Bacillus subtilis, E. coli, Streptococcus cremoris, andStreptococcus lividans, among others.

[0178] The bacterial expression vectors are transformed into bacterialhost cells using techniques well known in the art, such as calciumchloride treatment, electroporation, and others.

[0179] In one embodiment, library proteins are produced in insect cells.Expression vectors for the transformation of insect cells, and inparticular, baculovirus-based expression vectors, are well known in theart and are described e.g., in O'Reilly et al., Baculovirus ExpressionVectors: A Laboratory Manual (New York: Oxford University Press, 1994).

[0180] In a preferred embodiment, library protein is produced in yeastcells. Yeast expression systems are well known in the art, and includeexpression vectors for Saccharomyces cerevisiae, Candida albicans and C.maltosa, Hansenula polymorpha, Kluyveromyces fragilis and K. lactis,Pichia guillerimondii and P. pastoris, Schizosaccharomyces pombe, andYarrowia lipolytica. Preferred promoter sequences for expression inyeast include the inducible GAL1, 10 promoter, the promoters fromalcohol dehydrogenase, enolase, glucokinase, glucose-6-phosphateisomerase, glyceraldehyde-3-phosphate-dehydrogenase, hexokinase,phosphofructokinase, 3-phosphoglycerate mutase, pyruvate kinase, and theacid phosphatase gene. Yeast selectable markers include ADE2, HIS4,LEU2, TRP1, and ALG7, which confers resistance to tunicamycin; theneomycin phosphotransferase gene, which confers resistance to G418; andthe CUP1 gene, which allows yeast to grow in the presence of copperions.

[0181] The library protein may also be made as a fusion protein, usingtechniques well known in the art. Thus, for example, for the creation ofmonoclonal antibodies, if the desired epitope is small, the libraryprotein may be fused to a carrier protein to form an immunogen.Alternatively, the library protein may be made as a fusion protein toincrease expression, or for other reasons. For example, when the libraryprotein is an library peptide, the nucleic acid encoding the peptide maybe linked to other nucleic acid for expression purposes. Similarly,other fusion partners may be used, such as targeting sequences whichallow the localization of the library members into a subcellular orextracellular compartment of the cell, rescue sequences or purificationtags which allow the purification or isolation of either the libraryprotein or the nucleic acids encoding them; stability sequences, whichconfer stability or protection from degradation to the library proteinor the nucleic acid encoding it, for example resistance to proteolyticdegradation, or combinations of these, as well as linker sequences asneeded.

[0182] Thus, suitable targeting sequences include, but are not limitedto, binding sequences capable of causing binding of the expressionproduct to a predetermined molecule or class of molecules whileretaining bioactivity of the expression product, (for example by usingenzyme inhibitor or substrate sequences to target a class of relevantenzymes); sequences signalling selective degradation, of itself orco-bound proteins; and signal sequences capable of constitutivelylocalizing the candidate expression products to a predetermined cellularlocale, including a) subcellular locations such as the Golgi,endoplasmic reticulum, nucleus, nucleoli, nuclear membrane,mitochondria, chloroplast, secretory vesicles, lysosome, and cellularmembrane; and b) extracellular locations via a secretory signal.Particularly preferred is localization to either subcellular locationsor to the outside of the cell via secretion.

[0183] In a preferred embodiment, the library member comprises a rescuesequence. A rescue sequence is a sequence which may be used to purify orisolate either the candidate agent or the nucleic acid encoding it.Thus, for example, peptide rescue sequences include purificationsequences such as the His₆ tag for use with Ni affinity columns andepitope tags for detection, immunoprecipitation or FACS(fluoroscence-activated cell sorting). Suitable epitope tags include myc(for use with the commercially available 9E10 antibody), the BSPbiotinylation target sequence of the bacterial enzyme BirA, flu tags,lacZ, and GST.

[0184] Alternatively, the rescue sequence may be a uniqueoligonucleotide sequence which serves as a probe target site to allowthe quick and easy isolation of the retroviral construct, via PCR,related techniques, or hybridization.

[0185] In a preferred embodiment, the fusion partner is a stabilitysequence to confer stability to the library member or the nucleic acidencoding it. Thus, for example, peptides may be stabilized by theincorporation of glycines after the initiation methionine (MG or MGG0),for protection of the peptide to ubiquitination as per Varshavsky'sN-End Rule, thus conferring long half-life in the cytoplasm. Similarly,two prolines at the C-terminus impart peptides that are largelyresistant to carboxypeptidase action. The presence of two glycines priorto the prolines impart both flexibility and prevent structure initiatingevents in the di-proline to be propagated into the candidate peptidestructure. Thus, preferred stability sequences are as follows:MG(X)_(n)GGPP, where X is any amino acid and n is an integer of at leastfour.

[0186] In one embodiment, the library nucleic acids, proteins andantibodies of the invention are labeled. By “labeled” herein is meantthat nucleic acids, proteins and antibodies of the invention have atleast one element, isotope or chemical compound attached to enable thedetection of nucleic acids, proteins and antibodies of the invention. Ingeneral, labels fall into three classes: a) isotopic labels, which maybe radioactive or heavy isotopes; b) immune labels, which may beantibodies or antigens; and c) colored or fluorescent dyes. The labelsmay be incorporated into the compound at any position.

[0187] In a preferred embodiment, the library protein is purified orisolated after expression. Library proteins may be isolated or purifiedin a variety of ways known to those skilled in the art depending on whatother components are present in the sample. Standard purificationmethods include electrophoretic, molecular, immunological andchromatographic techniques, including ion exchange, hydrophobic,affinity, and reverse-phase HPLC chromatography, and chromatofocusing.For example, the library protein may be purified using a standardanti-library antibody column. Ultrafiltration and diafiltrationtechniques, in conjunction with protein concentration, are also useful.For general guidance in suitable purification techniques, see Scopes,R., Protein Purification, Springer-Verlag, NY (1982). The degree ofpurification necessary will vary depending on the use of the libraryprotein. In some instances no purification will be necessary.

[0188] Once expressed and purified if necessary, the library proteinsand nucleic acids are useful in a number of applications.

[0189] In general, the secondary libraries are screened for biologicalactivity. These screens will be based on the scaffold protein chosen, asis known in the art. Thus, any number of protein activities orattributes may be tested, including its binding to its known bindingmembers (for example, its substrates, if it is an enzyme), activityprofiles, stability profiles (pH, thermal, buffer conditions), substratespecificity, immunogenicity, toxicity, etc.

[0190] When random peptides are made, these may be used in a variety ofways to screen for activity. In a preferred embodiment, a firstplurality of cells is screened. That is, the cells into which thelibrary member nucleic acids are introduced are screened for an alteredphenotype. Thus, in this embodiment, the effect of the library member isseen in the same cells in which it is made; i.e. an autocrine effect.

[0191] By a “plurality of cells” herein is meant roughly from about 10³cells to 10⁸ or 10⁹, with from 10⁶ to 10⁸ being preferred. Thisplurality of cells comprises a cellular library, wherein generally eachcell within the library contains a member of the secondary library, i.e.a different library member, although as will be appreciated by those inthe art, some cells within the library may not contain one and and somemay contain more than one. When methods other than retroviral infectionare used to introduce the library members into a plurality of cells, thedistribution of library members within the individual cell members ofthe cellular library may vary widely, as it is generally difficult tocontrol the number of nucleic acids which enter a cell duringelectroporation, etc.

[0192] In a preferred embodiment, the library nucleic acids areintroduced into a first plurality of cells, and the effect of thelibrary members is screened in a second or third plurality of cells,different from the first plurality of cells, i.e. generally a differentcell type. That is, the effect of the library member is due to anextracellular effect on a second cell; i.e. an endocrine or paracrineeffect. This is done using standard techniques. The first plurality ofcells may be grown in or on one media, and the media is allowed to toucha second plurality of cells, and the effect measured. Alternatively,there may be direct contact between the cells. Thus, “contacting” isfunctional contact, and includes both direct and indirect. In thisembodiment, the first plurality of cells may or may not be screened.

[0193] If necessary, the cells are treated to conditions suitable forthe expression of the library members (for example, when induciblepromoters are used), to produce the library proteins.

[0194] Thus, in one embodiment, the methods of the present inventioncomprise introducing a molecular library of library members into aplurality of cells, a cellular library. The plurality of cells is thenscreened, as is more fully outlined below, for a cell exhibiting analtered phenotype. The altered phenotype is due to the presence of alibrary member.

[0195] By “altered phenotype” or “changed physiology” or othergrammatical equivalents herein is meant that the phenotype of the cellis altered in some way, preferably in some detectable and/or measurableway. As will be appreciated in the art, a strength of the presentinvention is the wide variety of cell types and potential phenotypicchanges which may be tested using the present methods. Accordingly, anyphenotypic change which may be observed, detected, or measured may bethe basis of the screening methods herein. Suitable phenotypic changesinclude, but are not limited to: gross physical changes such as changesin cell morphology, cell growth, cell viability, adhesion to substratesor other cells, and cellular density; changes in the expression of oneor more RNAs, proteins, lipids, hormones, cytokines, or other molecules;changes in the equilibrium state (i.e. half-life) or one or more RNAs,proteins, lipids, hormones, cytokines, or other molecules; changes inthe localization of one or more RNAs, proteins, lipids, hormones,cytokines, or other molecules; changes in the bioactivity or specificactivity of one or more RNAs, proteins, lipids, hormones, cytokines,receptors, or other molecules; changes in phosphorylation; changes inthe secretion of ions, cytokines, hormones, growth factors, or othermolecules; alterations in cellular membrane potentials, polarization,integrity or transport; changes in infectivity, susceptability, latency,adhesion, and uptake of viruses and bacterial pathogens; etc. By“capable of altering the phenotype” herein is meant that the librarymember can change the phenotype of the cell in some detectable and/ormeasurable way.

[0196] The altered phenotype may be detected in a wide variety of ways,and will generally depend and correspond to the phenotype that is beingchanged. Generally, the changed phenotype is detected using, forexample: microscopic analysis of cell morphology; standard cellviability assays, including both increased cell death and increased cellviability, for example, cells that are now resistant to cell death viavirus, bacteria, or bacterial or synthetic toxins; standard labelingassays such as fluorometric indicator assays for the presence or levelof a particular cell or molecule, including FACS or other dye stainingtechniques; biochemical detection of the expression of target compoundsafter killing the cells; etc. In some cases, as is more fully describedherein, the altered phenotype is detected in the cell in which therandomized nucleic acid was introduced; in other embodiments, thealtered phenotype is detected in a second cell which is responding tosome molecular signal from the first cell.

[0197] In a preferred embodiment, the library member is isolated fromthe positive cell. This may be done in a number of ways. In a preferredembodiment, primers complementary to DNA regions common to theconstructs, or to specific components of the library such as a rescuesequence, defined above, are used to “rescue” the unique randomsequence. Alternatively, the member is isolated using a rescue sequence.Thus, for example, rescue sequences comprising epitope tags orpurification sequences may be used to pull out the library member, usingimmunoprecipitation or affinity columns. In some instances, this mayalso pull out things to which the library member binds (for example theprimary target molecule) if there is a sufficiently strong bindinginteraction between the library member and the target molecule.Alternatively, the peptide may be detected using mass spectroscopy.

[0198] Once rescued, the sequence of the librarymember is determined.This information can then be used in a number of ways.

[0199] In a preferred embodiment, the member is resynthesized andreintroduced into the target cells, to verify the effect. This may bedone using retroviruses, or alternatively using fusions to the HIV-1 Tatprotein, and analogs and related proteins, which allows very high uptakeinto target cells. See for example, Fawell et al;., PNAS USA 91:664(1994); Frankel et al., Cell 55:1189 (1988); Savion et al., J. Biol.Chem. 256:1149 (1981); Derossi et al., J. Biol. Chem. 269:10444 (1994);and Baldin et al., EMBO J. 9:1511 (1990), all of which are incorporatedby reference.

[0200] In a preferred embodiment, the sequence of the member is used togenerate more libraries, as outlined herein.

[0201] In a preferred embodiment, the library member is used to identifytarget molecules, i.e. the molecules with which the member interacts. Aswill be appreciated by those in the art, there may be primary targetmolecules, to which the library member binds or acts upon directly, andthere may be secondary target molecules, which are part of thesignalling pathway affected by the library member; these might be termed“validated targets”.

[0202] The screening methods of the present invention may be useful toscreen a large number of cell types under a wide variety of conditions.Generally, the host cells are cells that are involved in disease states,and they are tested or screened under conditions that normally result inundesirable consequences on the cells. When a suitable library member isfound, the undesirable effect may be reduced or eliminated.Alternatively, normally desirable consequences may be reduced oreliminated, with an eye towards elucidating the cellular mechanismsassociated with the disease state or signalling pathway.

[0203] In a preferred embodiment, the library may be put onto a chip orsubstrate as an array to make a “protein chip” or “biochip” to be usedin high-throughput screening (HTS) techniques. Thus, the inventionprovides substrates with arrays comprising libraries (generallysecondary or tertiary libraries” of proteins.

[0204] By “substrate” or “solid support” or other grammaticalequivalents herein is meant any material that can be modified to containdiscrete individual sites appropriate for the attachment or associationof beads and is amenable to at least one detection method. As will beappreciated by those in the art, the number of possible substrates isvery large. Possible substrates include, but are not limited to, glassand modified or functionalized glass, plastics (including acrylics,polystyrene and copolymers of styrene and other materials,polypropylene, polyethylene, polybutylene, polyurethanes, Teflon®,etc.), polysaccharides, nylon or nitrocellulose, resins, silica orsilica-based materials including silicon and modified silicon, carbon,metals, inorganic glasses, plastics, optical fiber bundles, and avariety of other polymers. In general, the substrates allow opticaldetection and do not themselves appreciably fluorescese.

[0205] Generally the substrate is flat (planar), although as will beappreciated by those in the art, other configurations of substrates maybe used as well; for example, three dimensional configurations can beused. Similarly, the arrays may be placed on the inside surface of atube, for flow-through sample analysis to minimize sample volume.

[0206] By “array” herein is meant a plurality of library members in anarray format; the size of the array will depend on the composition andend use of the array. Arrays containing from about 2 different librarymembers to many thousands can be made. Generally, the array willcomprise from 10² to 10⁸ different proteins (all numbers are per squarecentimeter), with from about 10³ to about 10⁶ being preferred and fromabout 10³ to 10⁵ being particularly preferred. In addition, in somearrays, multiple substrates may be used, either of different oridentical compositions. Thus for example, large arrays may comprise aplurality of smaller substrates.

[0207] As will be appreciated by those in the art, the library membersmay either be synthesized directly on the substrate, or they may be madeand then attached after synthesis. In a preferred embodiment, linkersare used to attach the proteins to the substrate, to allow both goodattachment, sufficient flexibility to allow good interaction with thetarget molecule, and to avoid undesirable binding reactions.

[0208] In a preferred embodiment, the library members are synthesizedfirst, and tehn covalently or otherwise immobilized to the substrate.This may be done in a variety of ways, including known spottingtechniques, ink jet techniques, etc.

[0209] In a preferred embodiment, the library may be put onto a chip orsubstrate as an array to make a “protein chip” or “biochip” to be usedin high-throughput screening (HTS) techniques. Thus, the inventionprovides substrates with arrays comprising libraries (generallysecondary or tertiary libraries” of proteins.

[0210] By “substrate” or “solid support” or other grammaticalequivalents herein is meant any material that can be modified to containdiscrete individual sites appropriate for the attachment or associationof beads and is amenable to at least one detection method. As will beappreciated by those in the art, the number of possible substrates isvery large. Possible substrates include, but are not limited to, glassand modified or functionalized glass, plastics (including acrylics,polystyrene and copolymers of styrene and other materials,polypropylene, polyethylene, polybutylene, polyurethanes, Teflon®,etc.), polysaccharides, nylon or nitrocellulose, resins, silica orsilica-based materials including silicon and modified silicon, carbon,metals, inorganic glasses, plastics, optical fiber bundles, and avariety of other polymers. In general, the substrates allow opticaldetection and do not themselves appreciably fluorescese.

[0211] Generally the substrate is flat (planar), although as will beappreciated by those in the art, other configurations of substrates maybe used as well; for example, three dimensional configurations can beused. Similarly, the arrays may be placed on the inside surface of atube, for flow-through sample analysis to minimize sample volume.

[0212] By “array” herein is meant a plurality of library members in anarray format; the size of the array will depend on the composition andend use of the array. Arrays containing from about 2 different librarymembers to many thousands can be made. Generally, the array willcomprise from 10² to 10³ different proteins (all numbers are per squarecentimeter), with from about 10³ to about 10⁶ being preferred and fromabout 10³ to 10⁵ being particularly preferred. In addition, in somearrays, multiple substrates may be used, either of different oridentical compositions. Thus for example, large arrays may comprise aplurality of smaller substrates.

[0213] As will be appreciated by those in the art, the library membersmay either be synthesized directly on the substrate, or they may be madeand then attached after synthesis. In a preferred embodiment, linkersare used to attach the proteins to the substrate, to allow both goodattachment, sufficient flexibility to allow good interaction with thetarget molecule, and to avoid undesirable binding reactions.

[0214] In a preferred embodiment, the library members are synthesizedfirst, and tehn covalently or otherwise immobilized to the substrate.This may be done in a variety of ways, including known spottingtechniques, ink jet techniques, etc.

[0215] By “nucleic acid” or “oligonucleotide” or grammatical equivalentsherein means at least two nucleotides covalently linked together. Anucleic acid of the present invention will generally containphosphodiester bonds, although in some cases, as outlined below, nucleicacid analogs are included that may have alternate backbones, comprising,for example, phosphoramide (Beaucage et al., Tetrahedron 49(10):1925(1993) and references therein; Letsinger, J. Org. Chem. 35:3800 (1970);Sprinzl et al., Eur. J. Biochem. 81:579 (1977); Letsinger et al., Nucl.Acids Res. 14:3487 (1986); Sawai et al, Chem. Lett. 805 (1984),Letsinger et al., J. Am. Chem. Soc. 110:4470 (1988); and Pauwels et al.,Chemica Scripta 26:141 91986)), phosphorothioate (Mag et al., NucleicAcids Res. 19:1437 (1991); and U.S. Pat. No. 5,644,048),phosphorodithioate (Briu et al., J. Am. Chem. Soc. 111:2321 (1989),O-methylphophoroamidite linkages (see Eckstein, Oligonucleotides andAnalogues: A Practical Approach, Oxford University Press), and peptidenucleic acid backbones and linkages (see Egholm, J. Am. Chem. Soc.114:1895 (1992); Meier et al., Chem. Int. Ed. Engl. 31:1008 (1992);Nielsen, Nature, 365:566 (1993); Carlsson et al., Nature 380:207 (1996),all of which are incorporated by reference). Other analog nucleic acidsinclude those with positive backbones (Denpcy et al., Proc. Natl. Acad.Sci. USA 92:6097 (1995); non-ionic backbones (U.S. Pat. Nos. 5,386,023,5,637,684, 5,602,240, 5,216,141 and 4,469,863; Kiedrowshi et al., Angew.Chem. Intl. Ed. English 30:423 (1991); Letsinger et al., J. Am. Chem.Soc. 110:4470 (1988); Letsinger et al., Nucleoside & Nucleotide 13:1597(1994); Chapters 2 and 3, ASC Symposium Series 580, “CarbohydrateModifications in Antisense Research”, Ed. Y.S. Sanghui and P. Dan Cook;Mesmaeker et al., Bioorganic & Medicinal Chem. Lett. 4:395 (1994); Jeffset al., J. Biomolecular NMR 34:17 (1994); Tetrahedron Lett. 37:743(1996)) and non-ribose backbones, including those described in U.S. Pat.Nos. 5,235,033 and 5,034,506, and Chapters 6 and 7, ASC Symposium Series580, “Carbohydrate Modifications in Antisense Research”, Ed. Y. S.Sanghui and P. Dan Cook. Nucleic acids containing one or morecarbocyclic sugars are also included within the definition of nucleicacids (see Jenkins et al., Chem. Soc. Rev. (1995) pp169-176). Severalnucleic acid analogs are described in Rawls, C & E News Jun. 2, 1997page 35. All of these references are hereby expressly incorporated byreference. These modifications of the ribose-phosphate backbone may bedone to facilitate the addition of ETMs, or to increase the stabilityand half-life of such molecules in physiological environments.

[0216] As will be appreciated by those in the art, all of these nucleicacid analogs may find use in the present invention. In addition,mixtures of naturally occurring nucleic acids and analogs can be made.Alternatively, mixtures of different nucleic acid analogs, and mixturesof naturally occuring nucleic acids and analogs may be made.

[0217] The nucleic acids may be single stranded or double stranded, asspecified, or contain portions of both double stranded or singlestranded sequence. The nucleic acid may be DNA, both genomic and cDNA,RNA or a hybrid, where the nucleic acid contains any combination ofdeoxyribo- and ribo-nucleotides, and any combination of bases, includinguracil, adenine, thymine, cytosine, guanine, inosine, xathaninehypoxathanine, isocytosine, isoguanine, etc. A preferred embodimentutilizes isocytosine and isoguanine in nucleic acids designed to becomplementary to other probes, rather than target sequences, as thisreduces non-specific hybridization, as is generally described in U.S.Pat. No. 5,681,702. As used herein, the term “nucleoside” includesnucleotides as well as nucleoside and nucleotide analogs, and modifiednucleosides such as amino modified nucleosides. In addition,“nucleoside” includes non-naturally occurring analog structures. Thusfor example the individual units of a peptide nucleic acid, eachcontaining a base, are referred to herein as a nucleoside.

[0218] As will be appreciated by those in the art, the proteinaceouslibrary members may be attached to the substrate in a wide variety ofways. The functionalization of solid support surfaces such as certainpolymers with chemically reactive groups such as thiols, amines,carboxyls, etc. is generally known in the art. Accordingly, substratesmay be used that have surface chemistries that facilitate the attachmentof the desired functionality by the user. Some examples of these surfacechemistries include, but are not limited to, amino groups includingaliphatic and aromatic amines, carboxylic acids, aldehydes, amides,chloromethyl groups, hydrazide, hydroxyl groups, sulfonates andsulfates.

[0219] These functional groups can be used to add any number ofdifferent libraries to the substrates, generally using knownchemistries. For example, libraries containing carbohydrates may beattached to an amino-functionalized support; the aldehyde of thecarbohydrate is made using standard techniques, and then the aldehyde isreacted with an-amino group on the surface. In an alternativeembodiment, a sulfhydryl linker may be used. There are a number ofsulfhydryl reactive linkers known n the art such as SPDP, maleimides,a-haloacetyls, and pyridyl disulfides (see for example the 1994 PierceChemical Company catalog, technical section on cross-linkers, pages155-200, incorporated herein by reference) which can be used to attachcysteine containing members to the support. Alternatively, an aminogroup on the library member may be used for attachment to an amino groupon the surface. For example, a large number of stable bifunctionalgroups are well known in the art, including homobifunctional andheterobifunctional linkers (see Pierce Catalog and Handbook, pages155-200). In an additional embodiment, carboxyl groups (either from thesurface or from the protein) nay be derivatized using well known linkers(see the Pierce catalog). For example, carbodiimides activate carboxylgroups for attack by good nucleophiles such as amines (see Torchilin etal., Critical (Rev. Therapeutic Drug Carrier Systems, 7(4):275-308(1991), expressly incorporated herein). In addition, library proteinsmay also be attached using other techniques known in the art, forexample for the attachment of antibodies to polymers; see Slinkin etal., Bioconj. Chem. 2:342-348 (1991); Torchilin et al., supra;Trubetskoy et al., Bioconj. Chem. 3:323-327 (1992); King et al., CancerRes. 54:6176-6185 (1994); and Wilbur et al., Bioconjugate Chem.5:220-235 (1994), all of which are hereby expressly incorporated byreference). Similarly, when the library members are made recombinantly,the use of epitope tags (FLAG, etc.) or His6 tags allow the attachmentof the members to the surface i.e. with antibody coated surfaces, metal(Ni) surfaces, etc.). In addition, labeling the library members withbiotin or other binding partner pairs allows the use of avidin coatedsurfaces, etc. It should be understood that the proteins may be attachedin a variety of ways, including those listed above. What is important isthat manner of attachment does not significantly alter the functionalityof the protein; that is, the protein should be attached in such aflexible manner as to allow its interaction with a target.

[0220] Once the biochips are made, they may be used in any number offormats for a wide variety of purposes, as will be appreciated by thosein the art. For example, the scaffold protein serving as the librarystarting point may be an enzyme; by putting libraries of variants on achip, the variants can be screened for increased activity by addingsubstrates, or for inhibitors. Similarly, variant libraries of ligandscaffolds can be screened for increased or decreased binding affinity tothe binding partner, for example a cell surface receptor. Thus, in thisembodiment, for example, the extracellular portion of the receptor canbe added to the array and binding affinity tested under any number ofconditions; for example, binding and/or activity may be tested underdifferent pH conditions, different buffer, salt or reagentconcentrations, different temperatures, in the presence of competitivebinders, etc.

[0221] Thus, in a preferred embodiment, the methods comprisedifferential screening to identity bioactive gents that are capable ofeither binding to the variant proteins and/or modulating the activity ofthe variant proteins. “Modulation” in this context includes both anincrease in activity (e.g. enzymatic activity or binding affinity) and adecrease.

[0222] Another preferred embodiment utilizes differential screening toidentify drug candidates that bind to the native protein, but cannotbind to modified proteins.

[0223] Positive controls and negative controls may be used in theassays. Preferably all control and test samples are performed in atleast triplicate to obtain statistically significant results. Incubationof all samples is for a time sufficient for the binding of the agent tothe protein. Following incubation, all samples are washed free ofnon-specifically bound material and the amount of bound, generallylabeled agent determined.

[0224] A variety of other reagents may be included in the screeningassays. These include reagents like salts, neutral proteins, e.g.albumin, detergents, etc which may be used to facilitate optimalprotein-protein binding and/or reduce non-specific or backgroundinteractions. Also reagents that otherwise improve the efficiency of theassay, such as protease inhibitors, nuclease inhibitors, anti-microbialagents, etc., may be used. The mixture of components may be added in anyorder that provides for the requisite binding.

[0225] In a preferred embodiment, the activity of the variant protein isincreased; in another preferred embodiment, the activity of the variantprotein is decreased. Thus, bioactive agents that are antagonists arepreferred in some embodiments, and bioactive agents that are agonistsmay be preferred in other embodiments.

[0226] Thus, in a preferred embodiment, the biochips comprising thesecondary or tertiary libraries are used to screen candidate agents forbinding to library members. By “candidate bioactive agent” or “candidatedrugs” or grammatical equivalents herein is meant any molecule, e.g.proteins (which herein includes proteins, polypeptides, and peptides),small organic or inorganic molecules, polysaccharides, polynucleotides,etc. which are to be tested against a particular target. Candidateagents encompass numerous chemical classes. In a preferred embodiment,the candidate agents are organic molecules, particularly small organicmolecules, comprising functional groups necessary for structuralinteraction with proteins, particularly hydrogen bonding, and typicallyinclude at least an amine, carbonyl, hydroxyl or carboxyl group,preferably at least two of the functional chemical groups. The candidateagents often comprise cyclical carbon or heterocyclic structures and/oraromatic or polyaromatic structures substituted with one or morechemical functional groups.

[0227] Candidate agents are obtained from a wide variety of sources, aswill be appreciated by those in the art, including libraries ofsynthetic or natural compounds. As will be appreciated by those in theart, the present invention provides a rapid and easy method forscreening any library of candidate agents, including the wide variety ofknown combinatorial chemistry-type libraries.

[0228] In a preferred embodiment, candidate agents are syntheticcompounds. Any number of techniques are available for the random anddirected synthesis of a wide variety of organic compounds andbiomolecules, including expression of randomized oligonucleotides. Seefor example WO 94/24314, hereby expressly incorporated by reference,which discusses methods for generating new compounds, including randomchemistry methods as well as enzymatic methods. As described in WO94/24314, one of the advantages of the present method is that it is notnecessary to characterize the candidate bioactive agents prior to theassay; only candidate agents that bind to the target need be identified.In addition, as is known in the art, coding tags using split synthesisreactions may be done, to essentially identify the chemical moieties onthe beads.

[0229] Alternatively, a preferred embodiment utilizes libraries ofnatural compounds in the form of bacterial, fungal, plant and animalextracts that are available or readily produced, and can be attached tobeads as is generally known in the art.

[0230] Additionally, natural or synthetically produced libraries andcompounds are readily modified through conventional chemical, physicaland biochemical means. Known pharmacological agents may be subjected todirected or random chemical modifications, including enzymaticmodifications, to produce structural analogs.

[0231] In a preferred embodiment, candidate bioactive agents includeproteins, nucleic acids, and chemical moieties.

[0232] In a preferred embodiment, the candidate bioactive agents areproteins. In a preferred embodiment, the candidate bioactive agents arenaturally occurring proteins or fragments of naturally occurringproteins. Thus, for example, cellular extracts containing proteins, orrandom or directed digests of proteinaceous cellular extracts, may beattached to beads as is more fully described below. In this waylibraries of procaryotic and eucaryotic proteins may be made forscreening against any number of targets. Particularly preferred in thisembodiment are libraries of bacterial, fungal, viral, and mammalianproteins, with the latter being preferred, and human proteins beingespecially preferred.

[0233] In a preferred embodiment, the candidate bioactive agents arepeptides of from about 2 to about 50 amino acids, with from about 5 toabout 30 amino acids being preferred, and from about 8 to about 20 beingparticularly preferred. The peptides may be digests of naturallyoccurring proteins as is outlined above, random peptides, or “biased”random peptides. By“randomized” or grammatical equivalents herein ismeant that each nucleic acid and peptide consists of essentially randomnucleotides and amino acids, respectively. Since generally these randompeptides (or nucleic acids, discussed below) are chemically synthesized,they may incorporate any nucleotide or amino acid at any position. Thesynthetic process can be designed to generate randomized proteins ornucleic acids, to allow the formation of all or most of the possiblecombinations over the length of the sequence, thus forming a library ofrandomized candidate bioactive proteinaceous agents. In addition, thecandidate agents may themselves be the product of the invention; thatis, a library of proteinaceous candidate agents may be made using themethods of the invention.

[0234] The library should provide a sufficiently structurally diversepopulation of randomized agents to effect a probabilistically sufficientrange of diversity to allow binding to a particular target. Accordingly,an interaction library must be large enough so that at least one of itsmembers will have a structure that gives it affinity for the target.Although it is difficult to gauge the required absolute size of aninteraction library, nature provides a hint with the immune response: adiversity of 10⁷-10⁸ different antibodies provides at least onecombination with sufficient affinity to interact with most potentialantigens faced by an organism. Published in vitro selection techniqueshave also shown that a library size of 10⁷-10⁸ is sufficient to findstructures with affinity for the target. A library of all combinationsof a peptide 7 to 20 amino acids in length, such as generally proposedherein, has the potential to code for 20⁷ (10⁹) to 20²⁰. Thus, withlibraries of 10⁷-10⁸ different molecules the present methods allow a“working” subset of a theoretically complete interaction library for 7amino acids, and a subset of shapes for the 20²⁰ library. Thus, in apreferred embodiment, at least 10⁶, preferably at least 10⁷, morepreferably at least 10⁸ and most preferably at least 10⁹ differentsequences are simultaneously analyzed in the subject methods. Preferredmethods maximize library size and diversity.

[0235] Thus, in a preferred embodiment, the invention provides biochipscomprising libraries of variant proteins, with the library comprising atleast about 100 different variants, with at least about 500 differentvariants being preferred, about 1000 different variants beingparticularly preferred and about 5000-10,000 being especially preferred.

[0236] In one embodiment, the candidate library is fully randomized,with no sequence preferences or constants at any position In a preferredembodiment, the candidate library is biased. That is, some positionswithin the sequence are either held constant, or are selected from alimited number of possibilities. For example, in a preferred embodiment,the nucleotides or amino acid residues are randomized within a definedclass, for example, of hydrophobic amino acids, hydrophilic residues,sterically biased (either small or large) residues, towards the creationof cysteines, for cross-linking, prolines for SH-3 domains, serines,threonines, tyrosines or histidines for phosphorylation sites, etc., orto purines, etc.

[0237] In a preferred embodiment, the bias is towards peptides ornucleic acids that interact with known classes of molecules. Forexample, when the candidate bioactive agent is a peptide, it is knownthat much of intracellular signaling is carried out via short regions ofpolypeptides interacting with other polypeptides through small peptidedomains. For instance, a short region from the HIV-1 envelopecytoplasmic domain has been previously shown to block the action ofcellular calmodulin. Regions of the Fas cytoplasmic domain, which showshomology to the mastoparan toxin from Wasps, can be limited to a shortpeptide region with death-inducing apoptotic or G protein inducingfunctions. Magainin, a natural peptide derived from Xenopus, can havepotent anti-tumour and anti-microbial activity. Short peptide fragmentsof a protein kinase C isozyme (βPKC), have been shown to block nucleartranslocation of βPKC in Xenopus oocytes following stimulation. And,short SH-3 target peptides have been used as psuedosubstrates forspecific binding to SH-3 proteins. This is of course a short list ofavailable peptides with biological activity, as the literature is densein this area. Thus, there is much precedent for the potential of smallpeptides to have activity on intracellular signaling cascades. Inaddition, agonists and antagonists of any number of molecules may beused as the basis of biased randomization of candidate bioactive agentsas well.

[0238] Thus, a number of molecules or protein domains are suitable asstarting points for the generation of biased randomized candidatebioactive agents. A large number of small molecule domains are known,that confer a common function, structure or affinity. In addition, as isappreciated in the art, areas of weak amino acid homology may havestrong structural homology. A number of these molecules, domains, and/orcorresponding consensus sequences, are known, including, but are notlimited to, SH-2 domains, SH-3 domains, Pleckstrin, death domains,protease cleavage/recognition sites, enzyme inhibitors, enzymesubstrates, Traf, etc. Similarly, there are a number of known nucleicacid binding proteins containing domains suitable for use in theinvention. For example, leucine zipper consensus sequences are known.

[0239] In a preferred embodiment, the candidate bioactive agents arenucleic acids. By “nucleic acid” or “oligonucleotide” or grammaticalequivalents herein means at least two nucleotides covalently linkedtogether. A nucleic acid of the present invention will generally containphosphodiester bonds, although some cases, as outlined below, nucleicacid analogs are included that may have alternate backbones, comprising,for example, phosphoramide (Beaucage et al., Tetrahedron 49(10):19251993) and references therein; Letsinger, J. Org. Chem. 35:3800 (1970);Sprinzl et al., Eur. J. Biochem. 81:579 (1977); Letsinger et al., Nucl.Acids Res. 14:3487 (1986); Sawai et al, Chem. Lett. 805 (1984),Letsinger et al., J. Am. Chem. Soc. 110:4470 (1988); and Pauwels et al.,Chemica Scripta 26:141 (1986)), phosphorothioate (Mag et al., NucleicAcids Res. 19:1437 (1991); and U.S. Pat. No. 5,644,048),phosphorodithioate (Briu et al., J. Am. Chem. Soc. 111:2321 (1989),O-methylphophoroamidite linkages (see Eckstein, Oligonucleotides andAnalogues: A Practical approach, Oxford University Press), and peptidenucleic acid backbones and linkages (see Egholm, J. Am. Chem. Soc.114:1895 (1992); Meier et al., Chem. Int. Ed. Engl. 31:1008 (1992);Nielsen, Nature, 365:566 (1993); Carlsson et al., Nature 380:207 (1996),all of which are incorporated by reference). Other analog nucleic acidsinclude those with positive backbones (Denpcy et al., Proc. Natl. Acad.Sci. USA 92:6097 (1995); non-ionic backbones (U.S. Pat. Nos. 5,386,023,5,637,684, 5,602,240, 5,216,141 and 4,469,863; Kiedrowshi et al., Angew.Chem. Intl. Ed. English 30:423 (1991); Letsinger et al., J. Am. Chem.Soc. 110:4470 (1988); Letsinger et al., Nucleoside & Nucleotide 13:1597(1994); Chapters 2 and 3, ASC Symposium Series 580, “CarbohydrateModifications in Antisense Research”, Ed. Y. S. Sanghui and P. Dan Cook;Mesmaeker et al., Bioorganic & Medicinal Chem. Lett. 4:395 (1994); Jeffset al., J. Biomolecular NMR 34:17 (1994); Tetrahedron Lett. 37:743(1996)) and non-ribose backbones, including those described in U.S. Pat.Nos. 5,235,033 and 5,034,506, and Chapters 6 and 7, ASC Symposium Series580, “Carbohydrate Modifications in Antisense Research”, Ed. Y. S.Sanghui and P. Dan Cook. Nucleic acids containing one or morecarbocyclic sugars are also included within the definition of nucleicacids (see Jenkins et al., Chem. Soc. Rev. (1995) pp 169-176). Severalnucleic acid analogs are described in Rawls, C & E News Jun. 2, 1997page 35. All of these references are hereby expressly incorporated byreference. These modifications of the ribose-phosphate backbone may bedone to facilitate the addition of additional moieties such as labels,or to increase the stability and half-life of such molecules inphysiological environments.

[0240] As will be appreciated by those in the art, all of these nucleicacid analogs may find use in the present invention. In addition,mixtures of naturally occurring nucleic acids and analogs can be made.Alternatively, mixtures of different nucleic acid analogs, and mixturesof naturally occuring nucleic acids and analogs may be made.

[0241] The nucleic acids may be single stranded or double stranded, asspecified, or contain portions of both double stranded or singlestranded sequence. The nucleic acid may be DNA, both genomic and cDNA,RNA or a hybrid, where the nucleic acid contains any combination ofdeoxyribo- and ribonucleotides, and any combination of bases, includinguracil, adenine, thymine, cytosine, guanine, inosine, xathaninehypoxathanine, isocytosine, isoguanine, etc. As used herein, the term“nucleoside” includes nucleotides and nucleoside and nucleotide analogs,and modified nucleosides such as amino modified nucleosides. Inaddition, “nucleoside” includes non-naturally occuring analogstructures. Thus for example the individual units of a peptide nucleicacid, each containing a base, are referred to herein as a nucleoside.

[0242] As described above generally for proteins, nucleic acid candidatebioactive agents may be naturally occuring nucleic acids, random nucleicacids, or “biased” random nucleic acids. For example, digests ofprocaryotic or eucaryotic genomes may be used as is outlined above forproteins. Where the ultimate expression product is a nucleic acid, atleast 10, preferably at least 12, more preferably at least 15, mostpreferably at least 21 nucleotide positions need to be randomized, withmore preferable if the randomization is less than perfect. Similarly, atleast 5, preferably at least 6, more preferably at least 7 amino acidpositions need to be randomized; again, more are preferable if therandomization is less than perfect.

[0243] In a preferred embodiment, the candidate bioactive agents areorganic moieties. In this embodiment, as is generily described in WO94/24314, candidate agents are synthesized from a series of substratesthat can be chemically modified. “Chemically modified” herein includestraditional chemical reactions as well as enzymatic reactions. Thesesubstrates generally include, but are not limited to, alkyl groups(including alkanes, alkenes, alkynes and heteroalkyl), aryl groups(including arenes and heteroaryl), alcohols, ethers, amines, aldehydes,ketones, acids, esters, amides, cyclic compounds, aeterocyclic compounds(including purines, pyrimidines, benzodiazepins, beta-lactams,tetracylines, ephalosporins, and carbohydrates), steroids (includingestrogens, androgens, cortisone, ecodysone, atc.), alkaloids (includingergots, vinca, curare, pyrollizdine, and mitomycines), organometalliccompounds, hetero-atom bearing compounds, amino acids, and nucleosides.Chemical (including enzymatic) reactions may be done on the moieties toform new substrates or candidate agents which can then be tested usingthe present invention.

[0244] As will be appreciated by those in the art, it is possible toscreen more than one type of candidate agent at a time. Thus, thelibrary of candidate agents used in any particular assay may includeonly one type of agent (i.e. peptides), or multiple types (peptides andorganic agents).

[0245] Thus, in a preferred embodiment, the invention provides biochipscomprising variant libraries of at east one scaffold protein, andmethods of screening utilizing the biochips. Thus, for example, theinvention provides completely defined libraries of variant scaffoldproteins having a defined set number, wherein at least 85-90-95% of thepossible members are present in the library.

[0246] I addition, as will also be appreciated by those in the art, thebiochips of the invention may be part of HTS system utilizing any numberof components. Fully robotic or microfluidic systems include automatedliquid-, particle-, cell- and organism-handling including highthroughput pipetting to perform all steps of gene targeting andrecombination applications. This includes liquid, particle, cell, andorganism manipulations such as aspiration, dispensing, mixing, diluting,washing, accurate volumetric ransfers; retrieving, and discarding ofpipes tips; and repetitive pipetting of identical volumes for nultipledeliveries from a single sample aspiration. These manipulations arecross-contamination-free liquid, particle, cell, and organism transfers.This instrument performs automated replication of microplate samples tofilters, membranes, and/or daughter plates, high-density transfers,full-plate serial dilutions, and high capacity operation.

[0247] The system used can include a computer workstation comprising amicroprocessor programmed to manipulate a device selected from the groupconsisting of a thermocycler, a multichannel pipettor, a sample handler,a plate handler, a gel loading system, an automated transformationsystem, a gene sequencer, a colony picker, a bead picker, a cell sorter,an incubator, a light microscope, a fluorescence microscope, aspectrofluorimeter, a spectrophotometer, a luminometer, a CCD camera ndcombinations thereof.

[0248] In a preferred embodiment, the methods of the invention are usedto generate variant libraries to facilitate and correlate singlenucleotide polymorphism (SNP) analysis. That is, by drawing on known SNPdata and determining the effect of the SNP on the protein, informationconcerning SNP analysis can be determined. Thus, for example, making a“sequence alignment” of sorts using known SNPs can result in aprobability distribution table that can be used to design all possibleSNP variants, which can then be put on a biochip and tested for activityand effect.

[0249] The following examples serve to more fully describe the manner ofusing the above-described invention, as well as to set forth the bestmodes contemplated for carrying out various aspects of the invention. Itis understood that these examples in no way serve to limit the truescope of this invention, but rather are presented for illustrativepurposes. All references cited herein are incorporated by reference.

EXAMPLES Example 1 Computational Prescreening on β-lactamase TEM-1

[0250] Preliminary experiments were performed on the β-lactamase geneTEM-1. Brookhaven Protein Data Bank entry 1BTL was used as the startingstructure. All water molecules and the SO₄ ²⁻ group were removed andexplicit hydrogens were generated on the structure. The structure wasthen minimized for 50 steps without electrostatics using the conjugategradient method and the Dreiding II force field. These steps wereperformed using the BIOGRAF program (Molecular Simulations, Inc., SanDiego, Calif.). This minimized structure served as the template for allthe protein design calculations.

Computational Pre-screening

[0251] Computational pre-screening of sequences was performed using PDA.A 4 Å sphere was drawn around the heavy side chain atoms of the fourcatalytic residues (S70, K73, S130, and E166) and all amino acids havingheavy side chain atoms within this distance cutoff were selected. Thisyielded the following 7 positions: F72, Y105, N132, N136, L169, N170,and K234. Two of these residues, N132 and K234, are highly conservedacross several different β-lactamases and were therefore not included inthe design, leaving five variable positions (F72, Y105, N136, L169,N170). These designed positions were allowed to change their identity toany of the 20 naturally occurring amino acids except proline, cysteine,and glycine (a total of 17 amino acids). Proline is usually not allowedsince it is difficult to define appropriate rotamers for proline,cysteine is excluded to prevent formation of disulfide bonds, andglycine is excluded because of conformational flexibility.

[0252] Additionally, a second set of residues within 5 Å of the residuesselected for PDA design were floated (their amino acid identity wasretained as wild type, but their conformation was allowed to change).The heavy side chain atoms were again used to determine which residueswere within the cutoff. This yielded the following 28 positions: M68,M69, S70, T71, K73, V74, L76, V103, E104, S106, P107, I127, M129, S130,A135, L139, L148, L162, R164, W165, E166, P167, D179, M211, D214, V216,S235, I247. A248 was included as a floated position instead of I247. Thetwo prolines, P107 and P167, were excluded from the floated residues, aswere positions M69, R164, and W165, since their crystal structuresexhibit highly strained rotamers, leaving 23 floated residues from thesecond set. The conserved residues N132 and K234 from the first sphere(4 Å) were also floated, resulting in a total of floated residues.

[0253] The potential functions and parameters used in the PDAcalculations were as follows. The van der Waals scale factor was set to0.9, and the electrostatic potential was calculated using a distanceattenuation and a dielectric constant of 40. The well depth for thehydrogen bond potential was set to 8 kcal/mol with a local and remotebackbone scale factor of 0.25 and 1.0 respectively. The solvationpotential was only calculated for designed positions classified as core(F72, L169, M68, T71, V74, L76, I127, A135, L139, L148, L162, M211 andA248). Type 2 solvation was used (Street and Mayo, 1998). The non-polarexposure multiplication factor was set to 1.6, the non-polar burialenergy was set to 0.048 kcal/mol/A², and the polar hydrogen burialenergy was set to 2.0 kcal/mol.

[0254] Dead End Elimination (DEE) optimization method (see reference)was used to find the lowest energy, ground state sequence. DEE cutoffsof 50 and 100 kcal/mol were used for singles and doubles energycalculations, respectively.

[0255] Stating from the DEE ground state sequence, a Monte Carlo (MC)calculation was performed that generated a list of the 1000 lowestenergy sequences. The MC parameters were 100 annealing cycles with1,000,000 steps per cycle. The non-productive cycle limit was set to 50.In the annealing schedule, the high and low temperatures were set to5000 and 100 K respectively.

[0256] The following probability distribution was then calculated fromthe top 1000 sequences in the MC list (see Table 3 below). It shows thenumber of occurrences of each of the amino acids selected for eachposition (the 5 variable positions and the 25 floated positions). TABLE3 Monte Carlo analysis (amino acids and their number of occurrences (forthe top 1000 sequences) Posi- tion Amino acid occurrences  68 M: 1000 70 S: 1000  71 T: 1000  72 Y: 591 F: 365 V: 35 E: 8 L: 1  73 K: 1000 74 V: 1000  76 L: 1000 103 V: 1000 104 E: 1000 105 M: 183 Q: 142 I: 132N: 129 E: 126 S: 115 D: 97 A: 76 106 S: 1000 127 I: 1000 129 M: 1000 130S: 1000 132 N: 1000 135 A: 1000 136 O: 530 M: 135 N: 97 V: 68 E: 66 S:38 T: 33 A: 27 Q: 6 139 L: 1000 148 L: 1000 162 L: 1000 166 E: 1000 169L: 689 E: 156 M: 64 S: 37 D: 23 A: 21 Q: 10 170 M: 249 L: 118 E: 113 D:112 T: 90 Q: 87 S: 66 R: 44 A: 35 N: 24 F: 21 K: 15 Y: 9 H: 9 V: 8 179D: 1000 211 M: 1000 214 D: 1000 216 V: 1000 234 K: 1000 235 S: 1000 248A: 1000

[0257] This probability distribution was then transformed into a roundedprobability distribution (see Table 4). A 10% cutoff value was used toround at the designed positions and the wild type amino acids wereforced to occur with a probability of at least 10%. An E was found atposition 169 15.6% of the time. However, since this position is adjacentto another designed position, 170, its closeness would have required amore complicated oligonucleotide library design; E was therefore notincluded for this position when generating the sequence library (only Lwas used). TABLE 4 PDA probability distribution for the designedpositions of β-lactamase (rounded to the nearest 10%). 72 105 136 169170 Y 50% M 20% D 70% L 100% M 30% F 50% Q 20% M 20% L 20% I 20% N 10% E20% N 10% O 20% E 10% N 10% S 10% Y 10%

[0258] As seen from Table 4, the computational pre-screening resulted inan enormous reduction in the size of the problem. Originally, 17different amino acids were allowed at each of the 5 designed positions,giving 17⁵=1,419,857 possible sequences. This was pared down to just2*7*3*1*5=210 possible sequences—a reduction of nearly four orders ofmagnitude.

Generation of Sequence Library

[0259] Overlapping oligonucleotides corresponding to the full lengthTEM-1 gene for β-lacatamase and all desired mutations were synthesizedand used in a PCR reaction as described previously (FIG. 1), resultingin a sequence library containing the 210 sequences described above.

Synthesis of Mutant TEM-1 Genes

[0260] To allow the mutation of the TEM-1 gene, pCR2.1 (Invitrogen) wasdigested with Xbal and EcoRI, blunt ended with T4 DNA polymerase, andreligated. This removes the HindIII and XhoI sites within thepolylinker. A new XhoI site was then introduced into the TEM-1 gene atposition 2269 (numbering as of the original pCR2.1) using a QuickchangeSite-Directed Mutagenesis Kit as described by the manufacturer(Stratagene). Similarly, a new HindIII site was introduced at position2674 to give pCR-Xen1.

[0261] To construct the mutated TEM-1 genes, overlapping 40 meroligonucleotides were synthesized corresponding to the sequence betweenthe newly introduced Xho1 and HindIII sites, designed to allow a 20nucleotide overlap with adjacent oligonucleotides. At each of thedesigned positions (72,105, 136 and 170) multiple oligonucleotides weresynthesized, each containing a different mutation so that all thepossible combinations of mutant sequences (210) could be made in thedesired proportions as shown in Table 4. For example, at position 72,two sets of oligonucleotides were synthesized, one containing an F atposition 72, the other containing a Y. Each oligonucleotide wasresuspended at a concentration of 1 μg/μl, and equal molarconcentrations of the oligonucleotides were pooled.

[0262] At the redundant positions, each oligonucleotide was added at aconcentration that reflected the probabilities in Table 4. For example,at position 72 equal amounts of the two oligonucleotides were added tothe pool, while at position 136, twice as much M-containingoligonucleotide was added compared to the N-containing oligonucleotide,and seven times as much D-containing oligonucleotide was added comparedto the N-containing oligonucleotide.

DNA Library Assembly

[0263] For the first round of PCR, 2 μl of pooled oligonucleotides atthe desired probabilities (Table 4) were added to a 100 μl reaction thatcontained 2 μl 10 mM dNTPs, 10 μl 10x Taq buffer (Qiagen), 1 μl of TaqDNA polymerase (5 units/μl: Qiagen) and 2 μl Pfu DNA polymerase (2.5units/μl: Promega). The reaction mixture was assembled on ice andsubjected to 94° C. for 5 minutes, 15 cycles of 94° C. for 30 second,52° C. for 30 seconds and 72° C. for 30 seconds, and a final extensionstep of 72° C. for 10 minutes.

Isolation of Full Length Oligonucleotides

[0264] For the second round of PCR, 2.5 μl of the first round reactionwas added to a 100 μl reaction containing 2 μl 10 mM dNTPs, 10 μl of10×Pfu DNA polymerase buffer (Promega), 2 μl Pfu DNA polymerase (2.5 2:5units/μl: Promega), and 1 μg of oligonucleotides corresponding to the 5′and 3′ ends of the synthesized gene. The reaction mixture was assembledon ice and subjected to 94° C. for 5 minutes, 20 cycles of 94° C. for 30seconds, 52° C. for 30 seconds and 72° C. for 30 seconds, and a finalextension step of 72° C. for 10 minutes to isolate the full lengtholigonucleotides.

Purification of DNA Library

[0265] The PCR products were purified using a QlAquick PCR PurificationKit (Qiagen), digested with Xho1 and HindIII, electrophoresed through a1.2% agarose gel and re-purified using a QlAquick Gel Extraction Kit(Qiagen).

Verification of Sequence Library Identity

[0266] The PCR products containing the library of mutant TEM-1β-lactamase genes were then cloned between a promoter and terminator ina kanamycin resistant plasmid and transformed into E. coli. An equalnumber of bacteria were then spread onto media containing eitherkanamycin or ampicillin. All transformed colonies will be resistant tokanamycin, but only those with active mutated β-lactamase genes willgrow on ampicillin.

[0267] After overnight incubation, several colonies were observed onboth plates, indicating that at least one of the above sequences encodesan active β-lactamase. The number of colonies on the kanamycin plate faroutnumbered those on the ampicillin plate (roughly a 5:1 ratio)suggesting that either some of the sequences destroy activity, or thatthe PCR introduces errors that yield an inactive or truncated enzyme.

[0268] To distinguish between these possibilities, 60 colonies werepicked from the kanamycin plate and their plasmid DNA was sequenced.This gave the distribution shown in Table 5. TABLE 5 Percentagespredicted by PDA vs. those observed from experiment for the designedpositions. Wild Type PDA Residues (Predicted Percentage/ObservedPercentage)  72F Y 50/50 F 50/50 105Y M 20/27 Q 20/18 I 20/21 N 10/7 E10/7 S 10/10 Y 10/10 136N D 70/72 M 20/17 N 10/11 170N M 30/34 L 20/21 E20/21 D 20/17 N 10/7

[0269] Note that the observed percentages of each amino acid at all fourpositions closely match the predicted percentages. Sequencing alsorevealed that only one of the 60 colonies contained a PCR error, a G toC transition.

[0270] This small test demonstrates that multiple PCR with pooledoligonucleotides can be used to construct a sequence library thatreflects the desired proportions of amino acid changes.

Experimental Screening of Sequence Library

[0271] The purified PCR product containing the library of mutatedsequences was then ligated into pCR-Xen1 that had previously beendigested with Xho1 and HindIII and purified. The ligation reaction wastransformed into competent TOP10 E. coli cells (Invitrogen). Afterallowing the cells to recover for 1 hour at 37° C., the cells werespread onto LB plates containing the antibiotic cefotaxime atconcentrations ranging from 0.1 μg/ml to 50 μg/ml and selected forincreasing resistance.

[0272] A triple mutant was found that improved enzyme function by35 foldin only a single round of screening (see FIG. 4). This mutant (Y105Q,N136D, N170L) survived at 50 μg/ml cefotaxime.

Example 2 Secondary Library Generation of a Xylanase PDA Pre-screeningLeads to Enormous Reduction in Number of Possible Sequences

[0273] To demonstrate that computational pre-screening is feasible andwill lead to a significant reduction in the number of sequences thathave to be experimentally screened, initial calculations for the B.circulans xylanase with and without the substrate were performed. ThePDB structure 1XNB of B. circulans xylanase and 1BCX for the enzymesubstrate complex were used. 27 residues inside the binding site werevisually identified as belonging to the active site. 8 of these residueswere regarded as absolutely essential for the enzymatic activity. Thesepositions were treated as wild type residues, which means that theirconformation was allowed to change but not their amino acid identity(see FIG. 2).

[0274] Three of the 20 naturally occurring amino acids were notconsidered (cysteine, proline, and glycine). Therefore, 17 differentamino acids were still possible at the remaining 19 positions; theproblem yields 17¹⁹=2.4×10²³ different amino acid sequences. This numberis 10 orders of magnitude larger than what can be handled by state ofthe art directed evolution methods. Clearly these approaches cannot beused to screen the complete dimensionality of the problem and considerall sequences with multiple substitutions.

[0275] Therefore PDA calculations were performed to reduce the searchspace. A list of the 10,000 lowest energy sequences was created and theprobability for each amino acid at each position was determined (seeTable 1). TABLE 1 Probability of amino acids at the designed positionsresulting from the PDA calculation of the wild type (WT) enzymestructure. Only amino acids with a probability greater than 1 % areshown. WT PDA Probability Distribution  5 Y W 37.2% F 25.8% Y 22.9% H14.0%  7 Q E 69.1% L 30.2%  11 D I 41.2% D 10.7% V 10.1% M 7.9% L 6.4% E5.3% T 4.2% Q 3.8% Y 2.6% F 2.1% N 1.9% S 1.9% A 1.1%  37 V D 29.9% M29.4% V 21.4% S 12.8% I 4.1% E 1.0%  39 G A 99.8%  63 N W 91.2% Q 6.7% A1.4%  65 Y E 91.7% L 4.9% M 3.4%  67 T E 81.0% D 12.3% L 3.9% A 1.7%  71W V 37.8% F 25.5% W 8.5% M 6.0% D 5.8% E 4.3% I 1.0%  80 Y M 32.4% L31.5% F 19.0% I 5.9% Y 5.7% E 3.7%  82 V V 88.6% D 11.0%  88 Y N 91.1% K6.6% W 1.3% 110 T D 99.9% 115 A A 35.6% Y 27.8% T 14.4% D 10.2% S 9.2% F2.6% 118 E E 92.2% D 2.6% I 2.0% A 1.7% 125 F F 79.4% Y 11.8% M 7.3% L1.5% 129 W E 91.3% S 8.6% 168 V D 98.1% A 1.0% 170 A A 78.7% S 17.6% D3.7%

[0276] If we consider all the amino acids obtained from the PDAcalculation, including those with probabilities less than 1%, we obtain4.1×10¹⁵ different amino acid sequences. This is a reduction by 7 ordersof magnitude. If one only considers those amino acids that have at leasta probability of more than 1% as shown in Table 1 (1% criterion), theproblem is decreased to 3.3×10⁹ sequences. If one neglects all aminoacids with a probability of less than 5% (5% criterion) there are only4.0×10⁶ sequences left. This is a number that can be easily handled byscreening and gene shuffling techniques. Increasing the list of lowenergy sequences to 100,000 does not change these numbers significantlyand the effect on the amino acids obtained at each position isnegligible. Changes occur only among the amino acids with a probabilityof less than 1%.

[0277] Including the substrate in the PDA calculation further reducedthe number of amino acids found at each position. If we consider thoseamino acids with a probability higher than 5%, we obtain 2.4×10⁶sequences (see Table 2). TABLE 2 Probability of amino acids at thedesigned positions resulting from the PDA calculation of the enzymesubstrate complex. Only those amino acids with a probability greaterthan 1% are shown. WT PDA Probability Distribution  5 Y Y 69.2% W 17.0%H 7.3% F 6.0%  7 Q Q 78.1% E 18.0% L 3.9%  11 D D 97.1%  37 V V 50.9% D33.9% S 5.4% A 1.2% L 1.0%  39 G S 80.6% A 19.4%  63 N W 92.2% D 3.9% Q2.9%  65 Y E 91.1% L 8.7%  67 T E 92.8% L 5.2%  71 W W 62.6% E 13.3% M11.0% S 6.9% D 4.0%  80 Y M 66.4% F 13.6% E 10.7% I 6.0% L 1.3%  82 V V86.0% D 12.8%  88 Y W 55.1% Y 15.9% N 11.4% F 9.5% K 1.9% Q 1.4% D 1.4%M 1.4% 110 T D 99.9% 115 A D 46.1% S 27.8% T 17.1% A 7.9% 118 E I 47.6%D 43.0% E 3.6% V 2.5% A 1.4% 125 F Y 51.1% F 43.3% L 3.4% M 2.0% 129 W L63.2% M 28.1% E 7.5% 168 V D 98.2% 170 A T 92.3% A 5.9%

[0278] These preliminary calculations show that PDA can significantlyreduce the dimensionality of the problem and can bring it into the scopeof gene shuffling and screening techniques (see FIG. 3).

We claim:
 1. A method for generating a secondary library of scaffoldprotein variants comprising: a) providing a primary library comprising arank-ordered list of scaffold protein primary variant sequences; b)generating a list of primary variant positions in said primary library;c) combining a plurality of said primary variant positions to generate asecondary library of secondary sequences.
 2. A method for generating asecondary library of scaffold protein variants comprising: a) providinga primary library comprising a rank-ordered list of scaffold proteinprimary variant sequences; b) generating a probability distribution ofamino acid residues in a plurality of variant positions; c) combining aplurality of said amino acid residues to generate a secondary library ofsecondary sequences.
 3. A method according to claim 1 further comprisingsynthesizing a plurality of said secondary sequences.
 4. A methodaccording to claim 2 wherein said synthesizing is done by multiple PCRwith pooled oligonucleotides.
 5. A method according to claim 4 whereinsaid pooled oligonucleotides are added in equimolar amounts.
 6. A methodaccording to claim 4 wherein said pooled oligonucleotides are added inamounts that correspond to the frequency of the mutation.
 7. Acomposition comprising a plurality of secondary variant proteinscomprising a subset of said secondary library.
 8. A compositioncomprising a plurality of nucleic acids encoding a plurality ofsecondary variant proteins comprising a subset of said secondarylibrary.
 9. A method for generating a secondary library of scaffoldprotein variants comprising: a) providing a first library rank-orderedlist of scaffold protein primary variants; b) generating a probabilitydistribution of amino acid residues in a plurality of variant positions;c) synthesizing a plurality of scaffold protein secondary variantscomprising a plurality of said amino acid residues to form a secondarylibrary; wherein at least one of said secondary variants is differentfrom said primary variants.